Tuesday, December 11, 2012

Windows service frustration...

I've started a personal project that uses Solr, since we're using Solr at work and I figured it would be a great way to learn as much as I can about how to setup and use Solr.

My project currently consists of a Windows service that fetches data in one thread, and uses another thread to periodically index the fetched data into Solr.  

This is where the frustration comes in.  I have been debugging the Windows service code as I've been updating the database, or Solr schema, etc.  I'll find something that I want to change in the service code, so I'll uninstall the service, recompile the code, and re-install the service.  Occasionally, when I attempt to uninstall the service, Windows will display a message stating that the service is marked for deletion.  
The specified service has been marked for deletion
I won't be able to install the service until the service is uninstalled, and Windows won't uninstall the service until the machine is rebooted.  I tried the "sc delete <service name>" method for deleting a service, and I get the same message about the service being marked for deletion.  I ended up rebooting, and the service was gone when Windows restarted.  

I did a search for "remove service without rebooting" and I found this blog post.  The post says to uninstall the service with the services window closed.  I'll definitely follow the suggested advice the next time I have this issue.

Has anyone else experienced this problem and found a different way to avoid the problem?

Update: Following the advice from the blog post mentioned above definitely works.  I haven't had an issue as long as I was making sure that the services window was closed when uninstalling.  Perhaps it is coincidence, and I might as well have thrown chicken bones at the computer, but I followed the advice and I haven't seen the problem again.

Update (2013-01-09): I've now seen the "marked for deletion" issue at work while services are being deployed. The services window is most likely not opened on the target machine. My only guess is that there are multiple ways for Windows to decide that the service is in a state of being used, so it decides to mark the service as being marked for deletion to prevent any additional processes from trying to query the service for its state.  For now I will just try to make sure the services window is closed, and suffer through reboots when the service refuses to be uninstalled.

Wednesday, December 5, 2012

Heat Maps

I saw the following article on dev zone: Real-Time Twitter Heat Map with MongoDB


What?!?  People on the coasts and in Western Europe use twitter?!?

It was funny how similar the heat map was once you compare the heat map for the twitter users v. the heat maps in the xkcd.org heat map comic:






Sunday, November 25, 2012

Solr - Solr .Net Client

I was looking for Solr .Net clients and found SolrNet.  The SolrNet page has links for downloading the binaries, and it also has a link to the git repository.  

I downloaded the SolrNet source code so I could compare the performance of indexing documents using the SolrNet and SolrJ clients, and then built the code.  Next, I created a simple console application and referenced the SolrNet library.  I then created a method that was basically a copy of the Java code I used (while using SolrJ) for reading in a CSV file, and indexing batches of "documents".  The SolrJ version used POJOs (Plain Old Java Objects) with annotations specifying which Solr fields that the properties map to.  The SolrNet version used POCOs (Plain Old CLR Objects) with annotations specificying which Solr fields that the properties map to.

Here is an example of the POCO I used:


public class TestRecord
 {
    [SolrUniqueKey("id")]
    public string ID { get; set; }

    [SolrField("lookupids")]
    public List<int> LookupIDs { get; set; }
}

Here is an example of the code that indexed the values of the test records in batches:

// The solr server was initialized earlier in the code using 
// the following line of code:
// Startup.Init<TestRecord>("http://localhost:8983/solr");

public static void AddValues(List<TestRecord> testRecords)
{
    var solr = ServiceLocator.Current.GetInstance<ISolrOperations<TestRecord>>();
    solr.AddRange(testRecords);
    solr.Commit();
}

It seems to index the data about as fast as the SolrJ code - which isn't terribly surprising.  It appeared that it was slightly slower, but I will need to run multiple tests of varying batch sizes to see how similar or different the results are between SolrNet and SolrJ.

It took ~2.5 minutes to index 100000 documents when using batches of 100 test records, ~50 seconds for batches of 1000 test records, and ~45 seconds for batches of 10000 test records.

SolrNet is very easy to write code for querying against, or indexing into, a Solr index.  I was very pleased!

Friday, November 23, 2012

Reading RSS Feeds with Java and C#


I wanted to read a few RSS feeds using Java or C#.  I started to write my own code for the RSS feed, and quickly realized that using the stubbed code generated by the various XSDs available for RSS is kind of a pain.  I started to rewrite the code to use XML annotations, and that seemed like a bit too much work when it dawned on me that I should have done a search for RSS related code.  That's when I found Rome.

I was able to read an RSS feed with just two lines of code that I copy/pasted from the Rome tutorial page.  Very nice!

Here is a link to the tutorial:
http://wiki.java.net/twiki/bin/view/Javawsxml/Rome05TutorialFeedReader

I was also interested in seeing if there were any libraries for reading RSS feeds using C#.  I found RSS.Net.  http://www.rssdotnet.com/

It was really easy to get started by following the code examples that the author provided.

There is a handy library for you to use if you want to read RSS feeds whether you are using Java or C#.


Solr - Indexing Data Using SolrJ and addBeans

So far it looks like indexing data using SolrJ is considerably slower than indexing data using the update handler and a local CSV file.  It took about 36 to 40 seconds to index 100000 documents using SolrServer.addBeans() compared to about 17 to 18 seconds using the update handler and a local CSV file.

The code using SolrJ, listed below, was running on the same machine as Solr.

public static void IndexBeanValues(List testRecords) 
    throws IOException, SolrServerException {

    HttpSolrServer server = new HttpSolrServer("http://localhost:8983/solr");
    server.addBeans(testRecords);
    server.commit();
}

I tried passing in an instance to SolrServer, but it didn't make any noticeable difference for timing.  It might make more of a difference instantiating a new instance of SolrServer for each batch if the Java code using SolrJ is running on a different machine than the Solr server being targeted.

Refer to this post for a more detailed code example using SolrJ and addBeans.

Solr - Indexing Data Using SolrJ

I think I found one of the slowest ways possible to index data into Solr.  I'm looking into various ways to index data into Solr:
  • indexing text files local to the server that Solr is running on using the update handler
  • indexing data using an app using SolrJ that is running on the same server as Solr
  • indexing data using an app using SolrJ that is on a different machine on the same network that the Solr server is on
I was able to index 100000 items of data into Solr using the update handler to process a CSV file in about 17 to 18 seconds.  Next I tried indexing the same data using SolrJ.  It took about 6 minutes!  I'm sure that the reason it took so long is the way that I wrote the method to index the data.  

The method looks like this:

    public static void IndexValues(TestRecord[] testRecords
        throws IOException, SolrServerException {

        HttpSolrServer server = new HttpSolrServer("http://localhost:8983/solr");
        for(int i = 0; i < testRecords.length; ++i) {
            SolrInputDocument doc = new SolrInputDocument();
            doc.addField("id", testRecords[i].getId());
            for (Integer value : testRecords[i].getLookupIds()) {
                doc.addField("lookupids", value);
            }
            server.add(doc);
            if(i%100==0) server.commit();  // periodically flush
        }
        server.commit();

    }

I'll have to try something similar, but using beans.  It seems like it could be a bit faster if I used the addBeans method to add multiple documents at once.

Wednesday, November 21, 2012

Solr - Indexing Local CSV Files

As part of my investigation into all things Solr, and in particular the indexing of data into Solr cores, it occurred to me that indexing local files should be much faster than sending bits of data across a network to be indexed.  I am going to test the following:
  • indexing text files local to the server that Solr is running on using the update handler
  • indexing data using an app using SolrJ that is running on the same server as Solr
  • indexing data using an app using SolrJ that is on a different machine on the same network that the Solr server is on
My first step will be to get timings for indexing text files that are located on the Solr server.  I wrote an app that will generate random data with unique IDs (GUIDs), and one of the arguments is to specify how many records I should create.  I created a test file with 100000 records.



Here is an example of the output data:


The data is stored in a character separated value file with just two columns.  The first line of the data file lists the fields that the data will be indexed into, and the fields names are separated using the same separator that is used to divide the columns of data.  The first column is mapped to the schema's id field, and the second column is mapped to the schema's lookupids field.

I modified the schema.xml to add a field named "lookupids", set the type = "int", and set multivalued = "true".

I copied a file named testdata.txt to the exampledocs directory, and then imported the data using this URL:

http://localhost:8983/solr/update/csv?commit=true&separator=%09&f.lookupids.split=true&f.lookupids.separator=%2C&stream.file=exampledocs/testdata.txt

I found the information on what to use in the URL here: http://wiki.apache.org/solr/UpdateCSV

The parameters:

  • commit - The commit parameter being set to true will tell Solr to commit the changes after all the records in the request have been indexed.
  • separator - The separator is set to be a TAB character (%09 refers to ASCII value for the TAB character).
  • f.lookupids.split - The "f" is shorthand for field, and the field that is referenced is the "lookupids" field.  This parameter tells Solr to split the specified field into mutliple values.
  • f.lookupids.separator - The f.lookupids.separator parameter tells Solr to split the lookupids using the comma.
  • stream.file - The stream.file tells Solr to stream the file contents from the local file found at "exampledocs/testdata.txt".
The testdata.txt file was indexed into Solr in 17 to 18 seconds.

<response>
<lst name="responseHeader">
<int name="status">
0
</int>
<int name="QTime">
17033
</int>
</lst>
</response>

Next I'll try indexing the same file using SolrJ on the same machine that is running Solr.

Tuesday, November 20, 2012

Solr - Querying with SolrJ



I added a method, cleverly named FindIds, to my test code that will find the unique IDs by doing a search on lookup IDs that are in a range from 0 to 10000.  The query string looks like this:

"lookupids:[0 TO 10000]"

Even though the query is incredibly simple, I used the Solr admin page to test the query first.  That way I could know what to expect to see from SolrJ.  If I got different results from SolrJ, then I would know I that I would need to investigate to see why there was a difference.

Here is the code used:

public static void main(String[] args)  {
  ArrayList<String> ids = SolrIndexer.FindIds("lookupids:[0 TO 10000]");

  for (String id : ids) {
    System.out.printf("%s%n", id);
  }
}

public static ArrayList<String> FindIds(String searchString) {
  ArrayList<String> ids = new ArrayList<String>();
  int startPosition = 0;

  try {

    SolrServer server = new HttpSolrServer("http://localhost:8983/solr");
    SolrQuery query = new SolrQuery();
    query.setQuery(searchString);
    query.setStart(startPosition);
    query.setRows(20);
    QueryResponse response = server.query(query);
    SolrDocumentList docs = response.getResults();
    
    while (docs.size() > 0) {
      startPosition += 20;

      for(SolrDocument doc : docs) {
        ids.add(doc.get("id").toString());
      }

      query.setStart(startPosition);
      response = server.query(query);
      docs = response.getResults();
    }

  } catch (SolrServerException e) {
    e.printStackTrace();  
  }
  
  return ids;
}

If you don't set the number of rows to be returned using query.setRows(), then the default number of rows to be returned will be used.  The default used in the Solr example config is 10.  

If the query results in nothing being found, then the SolrDocumentList is instantiated with 0 items.  If there is an error with the query string, then an Exception will be thrown.

Something nice to add is a sort field.  ie, query.setSortField("id", SolrQuery.ORDER.asc);  

There are still some areas that I want to explore with SolrJ, so I will be posting more in the near future.

Monday, November 19, 2012

Solr - Indexing data using SolrJ


I'm using Solr at work, so I've been experimenting at home with various ways to index data into Solr.  The latest method I tried using is SolrJ.

The Setup

I have Solr set up on my Windows 7 box - I just downloaded the Solr 4.0 zip from http://www.apache.org/dyn/closer.cgi/lucene/solr/4.0.0.   I'm using IntelliJ Community Edition, so I created a new project and then added references to the necessary jar files by going to the Project Settings | Libraries section.  I clicked the + sign, and picked "Java", and then selected all of the SolrJ related jars.  SolrJ is distributed with Solr, and the related jar files for using SolrJ can be found in %SOLR_HOME%\dist and %SOLR_HOME%\dist\solrj-lib.

The Code

Using SolrJ to index data into Solr is amazingly simple.  The sample code found at solrtutorial.com is almost useable as copy and paste.  The version of Solr that the solrjtutorial site is targeting is for a version older than 4.0.  Everything will work as long as you change the import for CommonsHttpSolrServer to HttpSolrServer.  Of course you will also want to use field names that match your schema, but if you use the example solr instance (and therefore the example schema.xml) as a way to test your code then it will work fine.

First, I updated the %SOLR_HOME%\example\solr\collection1\conf\schema.xml file by adding a multi-valued int parameter called "lookupids".

<field name="lookupids" type="int" indexed="true" stored="true" multiValued="true"/>

I reloaded the core using the Solr admin page (from the Solr admin page click Core Admin, and then click the Reload button.  It should turn green after loading if the schema is valid.) to make sure that I didn't manage to screw up the schema.

Second, I created a new Java project using IntelliJ.  I had the code read a CSV file to populate an array of objects lookup IDs.  I used "lookup" IDs for no particular reason other than I thought it made as much sense as using any other arbitrary property name to search on.

After the code loads an array of Widgets, I had the code call a method called IndexValues.  Here is the mostly copy and paste code from the solrtutorial site:

public static void IndexValues(String solrDocId, List<Widget> widgets) throws 
    IOException, SolrServerException {

    HttpSolrServer server = new HttpSolrServer("http://localhost:8983/solr");
    for(Widget widget : widgets) {
        SolrInputDocument doc = new SolrInputDocument();
        doc.addField("id", solrDocId);
        for (Integer value : widget.getLookUpIds()) {
            doc.addField("lookupids", value);
        }
        server.add(doc);
        if(i%100==0) server.commit();  // periodically flush
    }
    server.commit();

}


It would be a good idea to have the URL and number of items to index between commits configurable, but I just wanted to get data indexed with as little work as possible.  It was very simple thanks to the solrtutorial.com site.

Also, it might be a good idea to make the HttpSolrServer instantiated with a singleton provider class.  The provider class could have helper methods for pinging the Solr instance, and for doing an optimize.  There might be well known patterns to follow, so I would read the wiki and look for existing examples first before creating a SolrJ based utility.

Let me know if you come across any good practices to follow, or pitfalls to avoid, when using SolrJ.

I'm going to try using SolrJ to do searches next.  Let me know if there is anything in particular I should watch out for, or if there is anything that you would like me to write about regarding Solr, SolrJ, etc.

Sunday, November 18, 2012

Solr - Indexing documents

We're using Solr 4.0 at work, so I decided that I should spend some time messing around with the gears and levers to make sure that I really understand what I'm doing.

I made a schema that included a field called "id" and a multi-valued field called "lookupids".  I created a file that had a header row of "id<tab>lookupids", and data rows that had a guid followed by random ints separated by commas.  ie,

269d8a33-0fd6-4877-b631-dccc4146cf90<tab>11507,25964,118430,306825,315793,348797,349191

The file contained 100000 entries, and I was able to index the file using a URL like this:

http://localhost:8983/solr/update/csv?commit=true&separator=%09&stream.file=exampledocs/test_with_lookupids.txt

One thing that I was expecting to happen was for the results to return the lookupids as an array.  Instead the lookupids field values are returned the same way they were stored in the source file.


<result name="response" numFound="1" start="0">
<doc>
<str name="id">
e09d8f38-c1ef-4a97-a832-a4bdc0b18bc5
</str>
<str name="lookupids">
2,16481,38485,50205,101885,107642,110903,142770,174184,193689,204770,223341,225669,242335,253654,278519,284132,333735,352163,372383,377816,401338,420851,443967,500899,575204,593052,645555,667294,742558,757738,804361,826200,828540,839016,859782,875115,877853,893658,915890,945398,954502,969859,971992,989172
</str>
<long name="_version_">
1419020904549056527
</long>
</doc>
</result>


The reason I was expecting the lookupids to be returned as an array is that the lookupids field was defined as follows:

<field name="lookupids" type="commaDelimited" indexed="true" stored="true" multivalued="true"/>

<fieldType name="commaDelimited" class="solr.TextField"> <analyzer> <tokenizer class="solr.PatternTokenizerFactory" pattern=",\s*" /> </analyzer> </fieldType>

I figured that having the field defined as multivalued, and having the commaDelimited type set to use the PatternTokenizer with a pattern that separates using the comma to identify tokens, would give the array response.

I'll update this post once I figure out how to get the results as an array.


Tuesday, October 30, 2012

Legacy App Nightmare

I'm working on a project for work that includes an update to a legacy application.  The application was written by contractors about six years ago, and it appears that they weren't given any time or incentive to refactor any of their work.

The legacy app is a Windows form application written using C#.  It probably had a decent design to start - some of the framework of the application makes it easy to add some new functionality to the application.  Other decisions they made are completely perplexing to me, and unfortunately their isn't anyone around to ask questions to of why things were implemented the way they were.

I had to start digging - stepping through the code, and making notes along the way.  One thing that occurred to me is that I never want someone to look at code I've written and feel as frustrated as I've felt when slogging through this particular legacy application code.  I feel like I follow good coding practices, but it is nice to have opportunities to see what things are like when good practices are not followed.

1. Comments are not bad - even in today's "agile" world.  Good coding principles dictate that we should name methods so that we know what the method is doing, but how do we tell people why we implemented the method the way we did it?  We can use comments!  If there is no ambiguity as to why a method is doing what it is doing, then no comment is necessary.  Just remember to consider whether or not the method will make sense to someone that was not involved in the original design when they look at it six years later.

2. Use informative and accurate method names.  This is not a new idea, but it doesn't mean that it is always easy or that it isn't overlooked.  The code I need to modify has a method called ProcessList.  That's great.  If I were to see that in a call stack, then how would I have any idea of what was being done?  I now need to look at the method to see if the contents in the list are being modified, if the contents in the list are keys for some other action that will work on some other bit of data, if the list is being modified, or something else.  Another thing to watch out for when it comes to method names is that your design might shift a bit, and your initial method names might not be quite as informative as they initially were.

3. Use informative variable names.  It is incredibly frustrating to see methods that have variables like this:


List<int> list = new List<int>();

I want to know what the list is being used for.  I don't want to just know that something is a list of ints.  The name might help you recognize a non-fatal logic problem in the code.

Refactoring is important to do on your project prior to releasing to production.  If you wait to refactor as a followup project, then there is a good chance that the time to do the refactoring will never be available.  It makes sense that the time won't become available, because the cost for the refactoring is potentially higher in a followup project/story than it is if you refactor as part of your original project/story.  The reason it might be more expensive is that you could be taking resources away from other projects that could be generating new revenue.  Also, it doesn't take too much time to go by before non-refactored code can become confusing if the why's aren't commented, and the method names are poorly chosen, etc.