Wednesday, November 27, 2013

AWS S3 - Example of searching files in S3 using regex and the ResponseStream

There have been times when I've needed to inspect contents of text files that were created as map reduce output and stored in S3. I had been downloading the files, but there were hundreds of files and they were all very big (around 360 MB each). It was a hassle since it would take a long time to download every file, and it wasted a lot of diskspace. I wanted a way to search for certain data, and then cancel my search so I could stop downloading so much data.  

The solution I chose to use was to use a Regex against the ResponseStream available when you do a GetObject call. That way I'm downloading data, but it isn't being stored on my computer.

Here is the main bit of code for searching the objects contents:

private void SearchObjectForString(AmazonS3 amazonS3, string bucketName, string key, string searchString)
    Cursor.Current = Cursors.WaitCursor;

    // Issue call
    var request = new GetObjectRequest();
    request.BucketName = bucketName;
    request.Key = key;

    using (var response = amazonS3.GetObject(request))
        using (var reader = new StreamReader(response.ResponseStream))
            string line;
            var rgx = new Regex(searchString, RegexOptions.IgnoreCase);

            while ((line = reader.ReadLine()) != null)
                if (cancelled)
                    Cursor.Current = Cursors.Default;

                var matches = rgx.Matches(line);
                if (matches.Count > 0)
                    lstResults.Items.Add(string.Format("{0}/{1}:{2}", request.BucketName, request.Key, line));
    Cursor.Current = Cursors.Default;

Here is a screen shot of the far from perfect regex search tool I made:
Regex Search Tool

As can be seen in the screen shot, the file I'm searching is stored as an object with the key "TestData/newpath/p-00000", and is a tab separated value file.  

The code for RegexSearchS3 can be found here.

Wednesday, November 20, 2013

Index POJOs into Solr using SolrJ and SolrServer.addBeans()

SolrJ makes indexing and querying data from solr instances very easy.  One way that SolrJ helps makes indexing data easier is that you can create a POJO (Plain Old Java Object) with annotations that will map the fields to your schema.

Note: I used Solr 4.4 (and SolrJ 4.4) when creating this example.

I created a POJO named SampleDoc that maps a number of fields to the example solr schema:

package org.codecognition;

import org.apache.solr.client.solrj.beans.Field;

public class SampleDoc {
    private String id;

    private String sku;

    private String name;

    private String title;

    private float price;

    private boolean inStock;

    public String getId() { return id; }

    public void setId(String id) { = id; }

    public String getSku() { return sku; }

    public void setSku(String sku) { this.sku = sku; }

    public String getName() { return name; }

    public void setName(String name) { = name; }

    public String getTitle() { return title; }

    public void setTitle(String title) { this.title = title; }

    public float getPrice() { return price; }

    public void setPrice(float price) { this.price = price; }

    public boolean isInStock() { return inStock; }

    public void setInStock(boolean inStock) { this.inStock = inStock; }

The POJO can then be indexed using the HttpSolrServer.addBeans() method:

package org.codecognition;

import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.UUID;

public class SolrJWithAddBeansExample {

    public static void main(String[] args) {
        try {
            HttpSolrServer server = new HttpSolrServer("http://localhost:8983/solr");


        } catch (SolrServerException | IOException e) {

    private static List<SampleDoc> makeTestRecords(int numRecords) {

        List<SampleDoc> sampleDocs = new ArrayList<SampleDoc>(numRecords);
        Random random = new Random();

        for (int i = 0; i < numRecords; i++) {
            SampleDoc sampleDoc = new SampleDoc();

            sampleDoc.setName(String.format("test%s", random.nextInt(10000)));
            sampleDoc.setTitle(String.format("test title %s", random.nextInt(10000)));

        return sampleDocs;

You can find the code here.