Monday, November 19, 2012

Solr - Indexing data using SolrJ


I'm using Solr at work, so I've been experimenting at home with various ways to index data into Solr.  The latest method I tried using is SolrJ.

The Setup

I have Solr set up on my Windows 7 box - I just downloaded the Solr 4.0 zip from http://www.apache.org/dyn/closer.cgi/lucene/solr/4.0.0.   I'm using IntelliJ Community Edition, so I created a new project and then added references to the necessary jar files by going to the Project Settings | Libraries section.  I clicked the + sign, and picked "Java", and then selected all of the SolrJ related jars.  SolrJ is distributed with Solr, and the related jar files for using SolrJ can be found in %SOLR_HOME%\dist and %SOLR_HOME%\dist\solrj-lib.

The Code

Using SolrJ to index data into Solr is amazingly simple.  The sample code found at solrtutorial.com is almost useable as copy and paste.  The version of Solr that the solrjtutorial site is targeting is for a version older than 4.0.  Everything will work as long as you change the import for CommonsHttpSolrServer to HttpSolrServer.  Of course you will also want to use field names that match your schema, but if you use the example solr instance (and therefore the example schema.xml) as a way to test your code then it will work fine.

First, I updated the %SOLR_HOME%\example\solr\collection1\conf\schema.xml file by adding a multi-valued int parameter called "lookupids".

<field name="lookupids" type="int" indexed="true" stored="true" multiValued="true"/>

I reloaded the core using the Solr admin page (from the Solr admin page click Core Admin, and then click the Reload button.  It should turn green after loading if the schema is valid.) to make sure that I didn't manage to screw up the schema.

Second, I created a new Java project using IntelliJ.  I had the code read a CSV file to populate an array of objects lookup IDs.  I used "lookup" IDs for no particular reason other than I thought it made as much sense as using any other arbitrary property name to search on.

After the code loads an array of Widgets, I had the code call a method called IndexValues.  Here is the mostly copy and paste code from the solrtutorial site:

public static void IndexValues(String solrDocId, List<Widget> widgets) throws 
    IOException, SolrServerException {

    HttpSolrServer server = new HttpSolrServer("http://localhost:8983/solr");
    for(Widget widget : widgets) {
        SolrInputDocument doc = new SolrInputDocument();
        doc.addField("id", solrDocId);
        for (Integer value : widget.getLookUpIds()) {
            doc.addField("lookupids", value);
        }
        server.add(doc);
        if(i%100==0) server.commit();  // periodically flush
    }
    server.commit();

}


It would be a good idea to have the URL and number of items to index between commits configurable, but I just wanted to get data indexed with as little work as possible.  It was very simple thanks to the solrtutorial.com site.

Also, it might be a good idea to make the HttpSolrServer instantiated with a singleton provider class.  The provider class could have helper methods for pinging the Solr instance, and for doing an optimize.  There might be well known patterns to follow, so I would read the wiki and look for existing examples first before creating a SolrJ based utility.

Let me know if you come across any good practices to follow, or pitfalls to avoid, when using SolrJ.

I'm going to try using SolrJ to do searches next.  Let me know if there is anything in particular I should watch out for, or if there is anything that you would like me to write about regarding Solr, SolrJ, etc.

No comments:

Post a Comment