Wednesday, November 27, 2013

AWS S3 - Example of searching files in S3 using regex and the ResponseStream

There have been times when I've needed to inspect contents of text files that were created as map reduce output and stored in S3. I had been downloading the files, but there were hundreds of files and they were all very big (around 360 MB each). It was a hassle since it would take a long time to download every file, and it wasted a lot of diskspace. I wanted a way to search for certain data, and then cancel my search so I could stop downloading so much data.  

The solution I chose to use was to use a Regex against the ResponseStream available when you do a GetObject call. That way I'm downloading data, but it isn't being stored on my computer.

Here is the main bit of code for searching the objects contents:


private void SearchObjectForString(AmazonS3 amazonS3, string bucketName, string key, string searchString)
{
    Cursor.Current = Cursors.WaitCursor;

    // Issue call
    var request = new GetObjectRequest();
    request.BucketName = bucketName;
    request.Key = key;

    using (var response = amazonS3.GetObject(request))
    {
        using (var reader = new StreamReader(response.ResponseStream))
        {
            string line;
            var rgx = new Regex(searchString, RegexOptions.IgnoreCase);

            while ((line = reader.ReadLine()) != null)
            {
                Application.DoEvents();
                if (cancelled)
                {
                    Cursor.Current = Cursors.Default;
                    return;
                }

                var matches = rgx.Matches(line);
                if (matches.Count > 0)
                {
                    lstResults.Items.Add(string.Format("{0}/{1}:{2}", request.BucketName, request.Key, line));
                }
            }
        }
    }
    Cursor.Current = Cursors.Default;
}

Here is a screen shot of the far from perfect regex search tool I made:
Regex Search Tool





















As can be seen in the screen shot, the file I'm searching is stored as an object with the key "TestData/newpath/p-00000", and is a tab separated value file.  

The code for RegexSearchS3 can be found here.

No comments:

Post a Comment