Sunday, November 25, 2012

Solr - Solr .Net Client

I was looking for Solr .Net clients and found SolrNet.  The SolrNet page has links for downloading the binaries, and it also has a link to the git repository.  

I downloaded the SolrNet source code so I could compare the performance of indexing documents using the SolrNet and SolrJ clients, and then built the code.  Next, I created a simple console application and referenced the SolrNet library.  I then created a method that was basically a copy of the Java code I used (while using SolrJ) for reading in a CSV file, and indexing batches of "documents".  The SolrJ version used POJOs (Plain Old Java Objects) with annotations specifying which Solr fields that the properties map to.  The SolrNet version used POCOs (Plain Old CLR Objects) with annotations specificying which Solr fields that the properties map to.

Here is an example of the POCO I used:


public class TestRecord
 {
    [SolrUniqueKey("id")]
    public string ID { get; set; }

    [SolrField("lookupids")]
    public List<int> LookupIDs { get; set; }
}

Here is an example of the code that indexed the values of the test records in batches:

// The solr server was initialized earlier in the code using 
// the following line of code:
// Startup.Init<TestRecord>("http://localhost:8983/solr");

public static void AddValues(List<TestRecord> testRecords)
{
    var solr = ServiceLocator.Current.GetInstance<ISolrOperations<TestRecord>>();
    solr.AddRange(testRecords);
    solr.Commit();
}

It seems to index the data about as fast as the SolrJ code - which isn't terribly surprising.  It appeared that it was slightly slower, but I will need to run multiple tests of varying batch sizes to see how similar or different the results are between SolrNet and SolrJ.

It took ~2.5 minutes to index 100000 documents when using batches of 100 test records, ~50 seconds for batches of 1000 test records, and ~45 seconds for batches of 10000 test records.

SolrNet is very easy to write code for querying against, or indexing into, a Solr index.  I was very pleased!

Friday, November 23, 2012

Reading RSS Feeds with Java and C#


I wanted to read a few RSS feeds using Java or C#.  I started to write my own code for the RSS feed, and quickly realized that using the stubbed code generated by the various XSDs available for RSS is kind of a pain.  I started to rewrite the code to use XML annotations, and that seemed like a bit too much work when it dawned on me that I should have done a search for RSS related code.  That's when I found Rome.

I was able to read an RSS feed with just two lines of code that I copy/pasted from the Rome tutorial page.  Very nice!

Here is a link to the tutorial:
http://wiki.java.net/twiki/bin/view/Javawsxml/Rome05TutorialFeedReader

I was also interested in seeing if there were any libraries for reading RSS feeds using C#.  I found RSS.Net.  http://www.rssdotnet.com/

It was really easy to get started by following the code examples that the author provided.

There is a handy library for you to use if you want to read RSS feeds whether you are using Java or C#.


Solr - Indexing Data Using SolrJ and addBeans

So far it looks like indexing data using SolrJ is considerably slower than indexing data using the update handler and a local CSV file.  It took about 36 to 40 seconds to index 100000 documents using SolrServer.addBeans() compared to about 17 to 18 seconds using the update handler and a local CSV file.

The code using SolrJ, listed below, was running on the same machine as Solr.

public static void IndexBeanValues(List testRecords) 
    throws IOException, SolrServerException {

    HttpSolrServer server = new HttpSolrServer("http://localhost:8983/solr");
    server.addBeans(testRecords);
    server.commit();
}

I tried passing in an instance to SolrServer, but it didn't make any noticeable difference for timing.  It might make more of a difference instantiating a new instance of SolrServer for each batch if the Java code using SolrJ is running on a different machine than the Solr server being targeted.

Refer to this post for a more detailed code example using SolrJ and addBeans.

Solr - Indexing Data Using SolrJ

I think I found one of the slowest ways possible to index data into Solr.  I'm looking into various ways to index data into Solr:
  • indexing text files local to the server that Solr is running on using the update handler
  • indexing data using an app using SolrJ that is running on the same server as Solr
  • indexing data using an app using SolrJ that is on a different machine on the same network that the Solr server is on
I was able to index 100000 items of data into Solr using the update handler to process a CSV file in about 17 to 18 seconds.  Next I tried indexing the same data using SolrJ.  It took about 6 minutes!  I'm sure that the reason it took so long is the way that I wrote the method to index the data.  

The method looks like this:

    public static void IndexValues(TestRecord[] testRecords
        throws IOException, SolrServerException {

        HttpSolrServer server = new HttpSolrServer("http://localhost:8983/solr");
        for(int i = 0; i < testRecords.length; ++i) {
            SolrInputDocument doc = new SolrInputDocument();
            doc.addField("id", testRecords[i].getId());
            for (Integer value : testRecords[i].getLookupIds()) {
                doc.addField("lookupids", value);
            }
            server.add(doc);
            if(i%100==0) server.commit();  // periodically flush
        }
        server.commit();

    }

I'll have to try something similar, but using beans.  It seems like it could be a bit faster if I used the addBeans method to add multiple documents at once.

Wednesday, November 21, 2012

Solr - Indexing Local CSV Files

As part of my investigation into all things Solr, and in particular the indexing of data into Solr cores, it occurred to me that indexing local files should be much faster than sending bits of data across a network to be indexed.  I am going to test the following:
  • indexing text files local to the server that Solr is running on using the update handler
  • indexing data using an app using SolrJ that is running on the same server as Solr
  • indexing data using an app using SolrJ that is on a different machine on the same network that the Solr server is on
My first step will be to get timings for indexing text files that are located on the Solr server.  I wrote an app that will generate random data with unique IDs (GUIDs), and one of the arguments is to specify how many records I should create.  I created a test file with 100000 records.

Here is an example of the output data:


The data is stored in a character separated value file with just two columns.  The first line of the data file lists the fields that the data will be indexed into, and the fields names are separated using the same separator that is used to divide the columns of data.  The first column is mapped to the schema's id field, and the second column is mapped to the schema's lookupids field.

I modified the schema.xml to add a field named "lookupids", set the type = "int", and set multivalued = "true".

I copied a file named testdata.txt to the exampledocs directory, and then imported the data using this URL:

http://localhost:8983/solr/update/csv?commit=true&separator=%09&f.lookupids.split=true&f.lookupids.separator=%2C&stream.file=exampledocs/testdata.txt

I found the information on what to use in the URL here: http://wiki.apache.org/solr/UpdateCSV

The parameters:

  • commit - The commit parameter being set to true will tell Solr to commit the changes after all the records in the request have been indexed.
  • separator - The separator is set to be a TAB character (%09 refers to ASCII value for the TAB character).
  • f.lookupids.split - The "f" is shorthand for field, and the field that is referenced is the "lookupids" field.  This parameter tells Solr to split the specified field into mutliple values.
  • f.lookupids.separator - The f.lookupids.separator parameter tells Solr to split the lookupids using the comma.
  • stream.file - The stream.file tells Solr to stream the file contents from the local file found at "exampledocs/testdata.txt".
The testdata.txt file was indexed into Solr in 17 to 18 seconds.

<response>
<lst name="responseHeader">
<int name="status">
0
</int>
<int name="QTime">
17033
</int>
</lst>
</response>

Next I'll try indexing the same file using SolrJ on the same machine that is running Solr.

Tuesday, November 20, 2012

Solr - Querying with SolrJ



I added a method, cleverly named FindIds, to my test code that will find the unique IDs by doing a search on lookup IDs that are in a range from 0 to 10000.  The query string looks like this:

"lookupids:[0 TO 10000]"

Even though the query is incredibly simple, I used the Solr admin page to test the query first.  That way I could know what to expect to see from SolrJ.  If I got different results from SolrJ, then I would know I that I would need to investigate to see why there was a difference.

Here is the code used:

public static void main(String[] args)  {
  ArrayList<String> ids = SolrIndexer.FindIds("lookupids:[0 TO 10000]");

  for (String id : ids) {
    System.out.printf("%s%n", id);
  }
}

public static ArrayList<String> FindIds(String searchString) {
  ArrayList<String> ids = new ArrayList<String>();
  int startPosition = 0;

  try {

    SolrServer server = new HttpSolrServer("http://localhost:8983/solr");
    SolrQuery query = new SolrQuery();
    query.setQuery(searchString);
    query.setStart(startPosition);
    query.setRows(20);
    QueryResponse response = server.query(query);
    SolrDocumentList docs = response.getResults();
    
    while (docs.size() > 0) {
      startPosition += 20;

      for(SolrDocument doc : docs) {
        ids.add(doc.get("id").toString());
      }

      query.setStart(startPosition);
      response = server.query(query);
      docs = response.getResults();
    }

  } catch (SolrServerException e) {
    e.printStackTrace();  
  }
  
  return ids;
}

If you don't set the number of rows to be returned using query.setRows(), then the default number of rows to be returned will be used.  The default used in the Solr example config is 10.  

If the query results in nothing being found, then the SolrDocumentList is instantiated with 0 items.  If there is an error with the query string, then an Exception will be thrown.

Something nice to add is a sort field.  ie, query.setSortField("id", SolrQuery.ORDER.asc);  

There are still some areas that I want to explore with SolrJ, so I will be posting more in the near future.

Monday, November 19, 2012

Solr - Indexing data using SolrJ


I'm using Solr at work, so I've been experimenting at home with various ways to index data into Solr.  The latest method I tried using is SolrJ.

The Setup

I have Solr set up on my Windows 7 box - I just downloaded the Solr 4.0 zip from http://www.apache.org/dyn/closer.cgi/lucene/solr/4.0.0.   I'm using IntelliJ Community Edition, so I created a new project and then added references to the necessary jar files by going to the Project Settings | Libraries section.  I clicked the + sign, and picked "Java", and then selected all of the SolrJ related jars.  SolrJ is distributed with Solr, and the related jar files for using SolrJ can be found in %SOLR_HOME%\dist and %SOLR_HOME%\dist\solrj-lib.

The Code

Using SolrJ to index data into Solr is amazingly simple.  The sample code found at solrtutorial.com is almost useable as copy and paste.  The version of Solr that the solrjtutorial site is targeting is for a version older than 4.0.  Everything will work as long as you change the import for CommonsHttpSolrServer to HttpSolrServer.  Of course you will also want to use field names that match your schema, but if you use the example solr instance (and therefore the example schema.xml) as a way to test your code then it will work fine.

First, I updated the %SOLR_HOME%\example\solr\collection1\conf\schema.xml file by adding a multi-valued int parameter called "lookupids".

<field name="lookupids" type="int" indexed="true" stored="true" multiValued="true"/>

I reloaded the core using the Solr admin page (from the Solr admin page click Core Admin, and then click the Reload button.  It should turn green after loading if the schema is valid.) to make sure that I didn't manage to screw up the schema.

Second, I created a new Java project using IntelliJ.  I had the code read a CSV file to populate an array of objects lookup IDs.  I used "lookup" IDs for no particular reason other than I thought it made as much sense as using any other arbitrary property name to search on.

After the code loads an array of Widgets, I had the code call a method called IndexValues.  Here is the mostly copy and paste code from the solrtutorial site:

public static void IndexValues(String solrDocId, List<Widget> widgets) throws 
    IOException, SolrServerException {

    HttpSolrServer server = new HttpSolrServer("http://localhost:8983/solr");
    for(Widget widget : widgets) {
        SolrInputDocument doc = new SolrInputDocument();
        doc.addField("id", solrDocId);
        for (Integer value : widget.getLookUpIds()) {
            doc.addField("lookupids", value);
        }
        server.add(doc);
        if(i%100==0) server.commit();  // periodically flush
    }
    server.commit();

}


It would be a good idea to have the URL and number of items to index between commits configurable, but I just wanted to get data indexed with as little work as possible.  It was very simple thanks to the solrtutorial.com site.

Also, it might be a good idea to make the HttpSolrServer instantiated with a singleton provider class.  The provider class could have helper methods for pinging the Solr instance, and for doing an optimize.  There might be well known patterns to follow, so I would read the wiki and look for existing examples first before creating a SolrJ based utility.

Let me know if you come across any good practices to follow, or pitfalls to avoid, when using SolrJ.

I'm going to try using SolrJ to do searches next.  Let me know if there is anything in particular I should watch out for, or if there is anything that you would like me to write about regarding Solr, SolrJ, etc.

Sunday, November 18, 2012

Solr - Indexing documents

We're using Solr 4.0 at work, so I decided that I should spend some time messing around with the gears and levers to make sure that I really understand what I'm doing.

I made a schema that included a field called "id" and a multi-valued field called "lookupids".  I created a file that had a header row of "id<tab>lookupids", and data rows that had a guid followed by random ints separated by commas.  ie,

269d8a33-0fd6-4877-b631-dccc4146cf90<tab>11507,25964,118430,306825,315793,348797,349191

The file contained 100000 entries, and I was able to index the file using a URL like this:

http://localhost:8983/solr/update/csv?commit=true&separator=%09&stream.file=exampledocs/test_with_lookupids.txt

One thing that I was expecting to happen was for the results to return the lookupids as an array.  Instead the lookupids field values are returned the same way they were stored in the source file.


<result name="response" numFound="1" start="0">
<doc>
<str name="id">
e09d8f38-c1ef-4a97-a832-a4bdc0b18bc5
</str>
<str name="lookupids">
2,16481,38485,50205,101885,107642,110903,142770,174184,193689,204770,223341,225669,242335,253654,278519,284132,333735,352163,372383,377816,401338,420851,443967,500899,575204,593052,645555,667294,742558,757738,804361,826200,828540,839016,859782,875115,877853,893658,915890,945398,954502,969859,971992,989172
</str>
<long name="_version_">
1419020904549056527
</long>
</doc>
</result>


The reason I was expecting the lookupids to be returned as an array is that the lookupids field was defined as follows:

<field name="lookupids" type="commaDelimited" indexed="true" stored="true" multivalued="true"/>

<fieldType name="commaDelimited" class="solr.TextField"> <analyzer> <tokenizer class="solr.PatternTokenizerFactory" pattern=",\s*" /> </analyzer> </fieldType>

I figured that having the field defined as multivalued, and having the commaDelimited type set to use the PatternTokenizer with a pattern that separates using the comma to identify tokens, would give the array response.

I'll update this post once I figure out how to get the results as an array.