Discovering the Art of Solr Search

Introduction :

“Why do we need a search server?”

Before answering these questions, let’s take a classical problem that affects the performance of many real time applications / products when their data grows beyond a limit. Let’s take an application or a product that has enormous amount of data scattered across around 20 database tables and have it searched through a main table and joined across all 20 database tables. Let’s assume the amount of data is also huge to build the search response.

“How would we do this in a real time scenario?”

Most of the techies would come up with the answer- NoSQL. But, is that the purpose of NoSQL? NoSQL means Non Structured Query Language. Products that do not have structured data need NoSQL, whereas the products that have structured data need not migrate to NoSQL just for addressing the problem of search through a huge amount of data. The answer is a proven search solution, a server that addresses the problem of search, which is called the search server and the one we are going to discuss is the Solr search server.

Apache Solr(pronounced “solar”) is an open source standalone enterprise full-text search platform. Solr is written in Java. Solr uses the Lucene Java search library at its core for full-text indexing and searching, and has REST-like HTTP/XML and JSON APIs that make it usable from most popular programming languages. Solr’s powerful external configuration allows it to be tailored to many types of applications without Java coding.

Indexing :

We are addressing Solr as a Search server. But we can’t make Solr Server to search through the datasources (RDBMS/File systems) directly.

“How can we make the Solr Server search through the data in datasources?”.

We need to port the data to be compatible for Solr Server. We call the process of adding / updating documents to Solr as Indexing. It is very easy to add data to Solr via JSON, XML, CSV over a HTTP request. Solr represent the data as documents. Documents are similar to rows in database tables.

“What are the things we need to do to add documents in Solr?”

First thing is to configure the Fields(similar to columns in database tables) and to configure the fields we need to determine the field type(similar to datatypes in column declaration). Solr provides some default field types for String, Integer, Boolean, Float. Based on our need, we can create our own types by defining the Analyzers.

Analyzer (Tokenizer + Filters):

So “What is analyzer?”

Analyzer is a combination of a Tokenizer and Filters. Tokenizer is nothing but how we want to distribute chunks of the whole content. Filters will applied through the chunk of data that processes or updates the terms based on the filter algorithm.

Experiment:

Imagine having configured a field type text_english with Analyzer combination of WhitespaceTokenizer with StopWordFilter(the stopwords to be filtered can be configured to load from an external text file) followed by a LowercaseFilter.

Just as we have created our own field type, let’s create a field with name “content” with the newly created field type “text_english”.

Now you may wonder, “How does the analyzer react while Indexing and Searching?”

Take a sample text “Solr is a Search Server”. We trying to index this content into Solr and see how it reacts. First it applies the configured WhitespaceTokenizer, the output of this is an array of strings [Solr, is, a, Search, Server]. Then the first StopwordFilter is applied to this array and the output is in the form [Solr,Search,Server] and then the next subsequent LowercaseFilter is applied and produces the final output [solr,search,server].

Finally we got the terms that are pointing to the newly created document id, for now let’s keep it as #1. Now Lucene internally maintains Inverted Index with ‘keys’ as the identified terms and ‘values’ as the document ids and its position in the document. And the original content also stored under the corresponding field name.

ID	Term	Document:Position
1	Solr	1:1
2	Search	1:4
3	Server	1:5

We have successfully indexed a document and understand what exactly is happening internally when using an analyzer.

Searching:

Now we have documents to start searching in Solr. Take an example of a search query “What is Solr”. We can use the same analyzer we defined above for indexing or else we can define a separate analyzer for querying. In our case, we can use the same analyzer. So the given input has been analyzed by the analyzer and final term [Solr] has been identified as it is tries to match it in Inverted Index I,finds the result document id as #1 and the position of the term at #1. The position is useful for highlighting the corresponding terms in the results.

Solr provides the following features for fields:

Indexed: The value of the field can be used in queries to retrieve matching documents.
Stored: The actual value of the field can be retrieved by queries.
Multivalued: Indicates that a single document might contain multiple values for this field type.
termVectors,termPositions,termOffsets,termPayloads: These options instruct Solr to maintain full term vectors for each document, optionally including position, offset and payload information for each term occurrence in those vectors.

Dynamic fields are a very powerful feature of Solr that you do not explicitly define in your schema.

For example, suppose your schema includes a dynamic field with a name of *_name. If you attempt to index a document with a first_name field, but no explicit first_name field is defined in the schema, then the first_name field will have the field type and analysis defined for *_name. The common ‘_name’ is matched to decide this value. We can also index fields with middle_name,last_name,sur_name.

Now we have understood how to configure fieldtypes, field, indexing and searching to work with Solr using Lucene.

DataImportHandler :

So “What is next?”.Solr provides REST-like API to add documents but it requires a painstakingly written program to create a task to fetch millions of records from any of the data sources like RDBMS or NoSQL and iterate through the records and index into Solr. To avoid this kind of extra work, Solr provides another useful feature called DataImportHandler where we just need to configure the datasource related information like drivers, url, username, password and the query that would populate the records. We must take precaution to make sure the record header is the same as the configured field name.

Incremental / Bulk Indexing :

Then the questions that trigger in your mind would be “What would happen if we want to add some partially updated records or newly added records in incremental fashion ? ” and “What would happen if we have millions of records in my datasource, is DataImportHandler designed to handle a large number of records?”.

Answer to the first question is, Solr provides incremental indexing where it will fetch only the updated records based on the configured updated_timestamp field in the delta query configuration.

Answer to the second question. Yes, absolutely. It can handle a large number of records. To make while configuring the datasource attributes add the batchCount=-1 attribute as well.

Other Cool Features :

“What else does Solr have?”

Solr has been designed to support distributed searching called SolrCloud.
Solr also supports advanced text analysis for various languages [Chinese, Japanese, German, French, etc]
Solr also came up with an administrative user interface to manage the core, schema, DataImportHandler etc.
Solr also supports faceting, filtering, geospatial search, highlighting search terms, spellchecking the input, typeahead suggestion features.
Solr supports native language support for Java, PHP, Python, Ruby andJavascript.

Conclusion :

Solr is a high value search platform, that would delight one with easily indexable content, filtered results, drilled down search results stage by stage, auto complete suggestions for search terms, correction of spelling errors and a setup that is easily extendable through plugins. Being an enterprise search setup, it will easily outperform competitors as the data and site content volume grows.