Wednesday, November 21, 2012

Solr - Indexing Local CSV Files

As part of my investigation into all things Solr, and in particular the indexing of data into Solr cores, it occurred to me that indexing local files should be much faster than sending bits of data across a network to be indexed.  I am going to test the following:
  • indexing text files local to the server that Solr is running on using the update handler
  • indexing data using an app using SolrJ that is running on the same server as Solr
  • indexing data using an app using SolrJ that is on a different machine on the same network that the Solr server is on
My first step will be to get timings for indexing text files that are located on the Solr server.  I wrote an app that will generate random data with unique IDs (GUIDs), and one of the arguments is to specify how many records I should create.  I created a test file with 100000 records.

Here is an example of the output data:


The data is stored in a character separated value file with just two columns.  The first line of the data file lists the fields that the data will be indexed into, and the fields names are separated using the same separator that is used to divide the columns of data.  The first column is mapped to the schema's id field, and the second column is mapped to the schema's lookupids field.

I modified the schema.xml to add a field named "lookupids", set the type = "int", and set multivalued = "true".

I copied a file named testdata.txt to the exampledocs directory, and then imported the data using this URL:

http://localhost:8983/solr/update/csv?commit=true&separator=%09&f.lookupids.split=true&f.lookupids.separator=%2C&stream.file=exampledocs/testdata.txt

I found the information on what to use in the URL here: http://wiki.apache.org/solr/UpdateCSV

The parameters:

  • commit - The commit parameter being set to true will tell Solr to commit the changes after all the records in the request have been indexed.
  • separator - The separator is set to be a TAB character (%09 refers to ASCII value for the TAB character).
  • f.lookupids.split - The "f" is shorthand for field, and the field that is referenced is the "lookupids" field.  This parameter tells Solr to split the specified field into mutliple values.
  • f.lookupids.separator - The f.lookupids.separator parameter tells Solr to split the lookupids using the comma.
  • stream.file - The stream.file tells Solr to stream the file contents from the local file found at "exampledocs/testdata.txt".
The testdata.txt file was indexed into Solr in 17 to 18 seconds.

<response>
<lst name="responseHeader">
<int name="status">
0
</int>
<int name="QTime">
17033
</int>
</lst>
</response>

Next I'll try indexing the same file using SolrJ on the same machine that is running Solr.

No comments:

Post a Comment