The most popular enterprise search engine
Apache Solr is a search platform based on Apache Lucene – a Java API for full-text search. Solr was originally developed in 2004 for the CNET Networks news portal. Since 2007, the search platform has ranked as a top level project at the Apache Foundation and accomplishes a lot more than just searching through text. Today, Solr is regarded as the most popular search engine in the enterprise environment. The user list includes the heavyweights AT&T, ebay, Instagram and Netflix. We have been using Solr as a search component for complex web applications for years – e.g. in connection with online shops or e-government platforms. The more recent ElasticSearch represents an alternative. Since it is maintained by a private company, the long-term further development of the search engine is still open as is the form that the license will take.
Flexible indexing and fast searching
Indexing and searching are the two central steps in working with Solr. Indexing is conceptually comparable with generating the index of a book in which keywords refer to page numbers. When a new chapter is added, the index has to be updated. When an index has been created, contents can be found very quickly via search words. For example, a project could involve implementing real time searching in a pool of 20 billion data sets with various criteria or combinations to enable the tracing of historical associations. Besides text, Solr supports many other data types such as pairs of coordinates or even geometrical figures.
Designed for distributed systems
One of the most important characteristics of Solr in the context of big data is that the technology scales well and is designed for distributed systems. If the index becomes very large, it can be distributed across several servers in so-called shards without much effort. A query is then subdivided into several sub-queries, each running on individual shards. Especially in connection with Hadoop, Solr impresses with these strengths. Hadoop can also perform the indexing of the content in a cluster at high speeds and parallelism.