Wednesday, November 27, 2013

AWS S3 - Example of searching files in S3 using regex and the ResponseStream

There have been times when I've needed to inspect contents of text files that were created as map reduce output and stored in S3. I had been downloading the files, but there were hundreds of files and they were all very big (around 360 MB each). It was a hassle since it would take a long time to download every file, and it wasted a lot of diskspace. I wanted a way to search for certain data, and then cancel my search so I could stop downloading so much data.  

The solution I chose to use was to use a Regex against the ResponseStream available when you do a GetObject call. That way I'm downloading data, but it isn't being stored on my computer.

Here is the main bit of code for searching the objects contents:


private void SearchObjectForString(AmazonS3 amazonS3, string bucketName, string key, string searchString)
{
    Cursor.Current = Cursors.WaitCursor;

    // Issue call
    var request = new GetObjectRequest();
    request.BucketName = bucketName;
    request.Key = key;

    using (var response = amazonS3.GetObject(request))
    {
        using (var reader = new StreamReader(response.ResponseStream))
        {
            string line;
            var rgx = new Regex(searchString, RegexOptions.IgnoreCase);

            while ((line = reader.ReadLine()) != null)
            {
                Application.DoEvents();
                if (cancelled)
                {
                    Cursor.Current = Cursors.Default;
                    return;
                }

                var matches = rgx.Matches(line);
                if (matches.Count > 0)
                {
                    lstResults.Items.Add(string.Format("{0}/{1}:{2}", request.BucketName, request.Key, line));
                }
            }
        }
    }
    Cursor.Current = Cursors.Default;
}

Here is a screen shot of the far from perfect regex search tool I made:
Regex Search Tool





















As can be seen in the screen shot, the file I'm searching is stored as an object with the key "TestData/newpath/p-00000", and is a tab separated value file.  

The code for RegexSearchS3 can be found here.

Wednesday, November 20, 2013

Index POJOs into Solr using SolrJ and SolrServer.addBeans()

SolrJ makes indexing and querying data from solr instances very easy.  One way that SolrJ helps makes indexing data easier is that you can create a POJO (Plain Old Java Object) with annotations that will map the fields to your schema.

Note: I used Solr 4.4 (and SolrJ 4.4) when creating this example.

I created a POJO named SampleDoc that maps a number of fields to the example solr schema:

package org.codecognition;

import org.apache.solr.client.solrj.beans.Field;

public class SampleDoc {
    @Field("id")
    private String id;

    @Field("sku")
    private String sku;

    @Field("name")
    private String name;

    @Field("title")
    private String title;

    @Field("price")
    private float price;

    @Field("inStock")
    private boolean inStock;

    public String getId() { return id; }

    public void setId(String id) { this.id = id; }

    public String getSku() { return sku; }

    public void setSku(String sku) { this.sku = sku; }

    public String getName() { return name; }

    public void setName(String name) { this.name = name; }

    public String getTitle() { return title; }

    public void setTitle(String title) { this.title = title; }

    public float getPrice() { return price; }

    public void setPrice(float price) { this.price = price; }

    public boolean isInStock() { return inStock; }

    public void setInStock(boolean inStock) { this.inStock = inStock; }
}


The POJO can then be indexed using the HttpSolrServer.addBeans() method:

package org.codecognition;

import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.UUID;

public class SolrJWithAddBeansExample {

    public static void main(String[] args) {
        try {
            HttpSolrServer server = new HttpSolrServer("http://localhost:8983/solr");

            server.addBeans(makeTestRecords(100));

            server.commit();
        } catch (SolrServerException | IOException e) {
            e.printStackTrace();
        }
    }

    private static List<SampleDoc> makeTestRecords(int numRecords) {

        List<SampleDoc> sampleDocs = new ArrayList<SampleDoc>(numRecords);
        Random random = new Random();

        for (int i = 0; i < numRecords; i++) {
            SampleDoc sampleDoc = new SampleDoc();

            sampleDoc.setId((UUID.randomUUID().toString()));
            sampleDoc.setInStock(true);
            sampleDoc.setName(String.format("test%s", random.nextInt(10000)));
            sampleDoc.setPrice(random.nextFloat());
            sampleDoc.setSku("somesku");
            sampleDoc.setTitle(String.format("test title %s", random.nextInt(10000)));
            sampleDocs.add(sampleDoc);
        }

        return sampleDocs;
    }
}

You can find the code here.

Friday, August 23, 2013

Using custom matchers for POJOs with Mockito...

I was trying to do the "right" thing by using TDD while working on a project, and I hit what looks to be a common problem of figuring out how to tell mock objects what to return based on whatever the argument value was that was passed in to the mock's method. It is really easy to specify which values to look for if you are dealing with a simple data type, but it isn't as straight forward if the argument is a complex object. The solution I used is the one I found on StackOverflow.

Here is the code I used:
// member of test class
private MyWorker mockMyWorker = mock(MyWorker.class);
private List<String> listOfOutputValues;
private final String theValueIWant = "some test value";

The class I am testing will use a helper class named MyWorker in this example. The class being tested will call the method MyWorker.listOutputValues(SomeObject) to get a list of output values. The SomeObject has an attribute named someValue. I only want the MyWorker class to return the output listing if the someValue matches a specific value.

I added a "when" statement in the @Before section of the test class. The "when" statement uses the custom matcher to check the value of SomeObject.getSomeValue().

@Before
private void Setup() {
 listOfOutputValues = createTestOutputValues();

 when(mockMyWorker.
                listOutputValues(argThat(hasValidAttributeValue()))).
                   thenReturn(listOfOutputValues);
}

When the mockMyWorker.listOutputValues(SomeObject) method is called in the test code, then it will return the list of Strings if SomeObject.getSomeValue() equals theValueIWant. Here is how the custom matcher is defined in the test class:
// private method in test class
private Matcher<SomeObject> hasValidAttributeValue() {
 return new BaseMatcher<SomeObject>() {
  @Override
  public boolean matches(Object o) {
   return ((SomeObject)o).getSomeValue().equals(theValueIWant);
  }

  @Override
  public void describeTo(Description description) {

  }
 };
}

Monday, August 19, 2013

Ask the rubber duck...

Have you ever been moving along on a project and - BAM! - you hit a brick wall of a problem where it isn't obvious to you on how you should handle the problem?

I had that happen last night while I was working on a personal project. I had just read a post on codinghorror.com that mentioned a problem solving technique which involves asking a rubber duck whatever question you might have about a problem before you go to a teammate or boss. The idea is that if you take the time to formulate the question in an understandable way, and ask the question out loud, then you have a good chance of figuring out a solution to your problem by yourself.

I've had numerous times where I will start talking to a teammate about a problem I'm having and the solution will come to me while I'm explaining the issue.  While it is really helpful to be able to talk to a teammate to work through the problem, it is a lot less distracting to others, and much more rewarding, to work through the problem on my own.

I thought it might look a bit awkward if I started talking to inanimate objects at the house with my wife and kids there, so I decided to talk through the problem with my wife instead.  Of course it worked great!  I spent 5 minutes (maybe less) talking about the problem, and the solution came to me.  Next time I think I will go in the garage and talk out the problem.

I've also had similar success when going for walks.  I will specifically not think about the problem at first. About half way through the walk I will start to think about the problem I'm having, and the problem will usually seem a lot less confusing.

In any case, rubber duck or no rubber duck - talking through problems is extremely helpful.  You might look a bit crazy if you start talking to rubber ducks, but at least you will be able to get your work done.


Thursday, August 15, 2013

Great idea from Effective Programming book...

I was reading Jeff Atwood's book Effective Programming, which is based on his excellent Coding Horror blog, and came across a great suggestion. 

Git clone projects that you use.  Reason - if you encounter a problem while using some third party project, then you will better understand how to debug the problem.  You might be able to find and fix bugs in the third party project, but you might be just as likely to learn how you are using some other code incorrectly.

One step further - git clone interesting projects.  Reason - see how other people code and how it differs from what you do. Think about why you like some bit of code or why you don't like it.  The important thing is to have an opinion.  You can always change your mind later.

I've sort of been doing this, but I'm going to try to be more methodical about it now.  It seems like a great way to learn, and get faster at diagnosing root causes of bugs.

Sunday, August 11, 2013

Using git with Google Code projects on Windows...

I decided to host a project on Google code, and chose git as the source repository since that is what I am familiar with. The Google code project page presents an auto generated password when you first create your project for use when pushing changes. You can regenerate this password if you need to.

One drawback of using an auto generated password is that they are purposely hard to remember, and therefore are not quite as easy to type in accurately on the command line. The projects source section suggests the following:

Add the following to your .netrc. 
machine code.google.com login <you google email address> password [generated googlecode.com password] 

I had no idea what a .netrc file was, so I looked around. At least on *nix machines, the .netrc file contains auto login information that can be used by ftp and rexec commands.

From what I read it sounded like I just needed to create the .netrc file in my user's home directory with the line as it is listed above.  I created the .netrc file, but I was still being prompted for my username and password when I would attempt to push changes to the repository. I searched a little more and I found a comment by named Dave Borowitz saying that the file should be named _netrc on Windows machines.

I changed my file name to _netrc and then I was able to push changes without entering my password.  

I didn't spend a great deal of time looking for the answer, but I feel like I was a bit lucky as well. I think there should be a mention about naming the file _netrc on the Google project source page.

Thursday, July 25, 2013

Alias often...

It's often very frustrating to retype shell commands over and over.  One thing that I sometimes forget to do is to create aliases for common commands.  I just added an alias for something I do quite often - search the shell history for some command I typed in either the previous or current session.

Edit your config file of choice (I updated .profile) and add the following:

alias hgrep='history | grep'

Now I can just type hgrep along with the search phrase of choice.  ie,

somemachine:someuser:/opt/somedir/home/username: hgrep 'start'
  235  ../bin/startup.sh
  250  ../bin/startup.sh
  256  ../bin/startup.sh
 1023  hgrep 'startserv'
 1024  hgrep 'start'

Wednesday, July 17, 2013

Toyota Tacoma compass display fix...

The compass and temperature overhead display in my 2007 Toyota Tacoma had stopped working a couple of years ago.  I called the dealership and found out that they don't fix that issue - they just replace the unit. A new compass unit, and the replacement time, costs about $300 (at least it did a couple years ago). I didn't want to  pay that much, so I just left it alone.  

Recently I found a YouTube video that shows someone repairing the unit.  The issue that the repair video was addressing is that some of the resistors have weak solder connections, and re-soldering the resistors with poor contact will fix the issue. It turned out to be that easy for me, and I was happy to have a video as a guide.  The person who made the video warned against putting too much stress on some of the plastic tabs, and that warning kept me from breaking tabs.

I removed the overhead display, removed  the display from the housing, and inspected the resistors. There were a number of surface mount resistors, but only two resistors (the ones marked 510) looked like they might not be making good contact.  I used a multimeter to check all of the resistors on the board to make sure they were okay.  Perhaps that was overkill - I wouldn't expect that the resistors would be bad - but it gave me an excuse to use the multimeter and that was good enough for me. Next, I heated up the existing solder with my soldering iron. I figured that would  restore good contact between the board and the resistors.  Then I added a tiny amount of solder to ensure that the connection was good.  I rechecked the resistors with the multimeter, and then reinstalled the overhead display.  I started the truck, and nothing displayed. I was pretty disappointed for a second, but I pushed the button on the display and there it was - temperature and compass direction!  After two years or so of not having a functioning display I had forgotten that the button cycles between Celsius, Fahrenheit, and turning the display off.

Everything worked great, and I saved $300!


Tuesday, July 16, 2013

AWS S3 Sample Using TransferUtility.DownloadDirectory...

I need to write a utility for a project that I'm working on, and the utility will need to download all of the files in an S3 directory. Luckily the AWSSDK provides an easy way to do this with the TransferUtility.DownloadDirectory method.

The following is a simple example usage of the DownloadDirectory method.  


public class S3Downloader
{

   public void DownloadS3Directory(string bucketName, string s3Directory, 
                                   string localDirectory)
   {
      var s3Config = new AmazonS3Config
      {
         ServiceURL = "s3-us-west-2.amazonaws.com",
         CommunicationProtocol = Protocol.HTTP
      };

      using (var s3Client = new AmazonS3Client(
                                   new EnvironmentAWSCredentials(), 
                                   s3Config))
      {
         using (var transferUtility = new TransferUtility(s3Client))
         {
            var ddr = new TransferUtilityDownloadDirectoryRequest
            {
               BucketName = bucketName,
               LocalDirectory = localDirectory,
               S3Directory = s3Directory
            };

            ddr.DownloadedDirectoryProgressEvent += DisplayProgress;
            transferUtility.DownloadDirectory(ddr);                
         }
      }
   }

   private void DisplayProgress(object sender, 
                                DownloadDirectoryProgressArgs args)
   {
      Console.WriteLine(args);
   }
}

public class Program
{
   public static void Main(string[] args)
   {
      string bucketName = "mybucket";
      string s3Directory = "/archived/files/2013-07";
      string localDirectory = @"C:\Temp\s3test";

      var s3Downloader = new S3Downloader();
      s3Downloader.DownloadS3Directory(bucketName, 
                                       s3Directory, 
                                       localDirectory);
   }
}

Here is an example of what is written to the console:

Total Files: 14, Downloaded Files 0, Total Bytes: 57390654, Transferred Bytes: 8192
Total Files: 14, Downloaded Files 0, Total Bytes: 57390654, Transferred Bytes: 16384
Total Files: 14, Downloaded Files 0, Total Bytes: 57390654, Transferred Bytes: 24576
Total Files: 14, Downloaded Files 0, Total Bytes: 57390654, Transferred Bytes: 32768
Total Files: 14, Downloaded Files 0, Total Bytes: 57390654, Transferred Bytes: 40960
Total Files: 14, Downloaded Files 0, Total Bytes: 57390654, Transferred Bytes: 49152

Thursday, July 4, 2013

Line endings and views, although unrelated, can both cause confusion...

Line Endings

Boy how I wish there weren't a difference between Unix and Windows line endings. I recently needed to search through a bunch of map reduce output (tons of reduce files in text format), so I decided to bulk insert them into a SQL database table.

I wrote a bulk insert statement for each of the reduce files (well, some SQL code wrote the bulk insert statements based on a list of file names), and I quickly encountered errors when attempting to execute the statements. The bulk insert statements looked something like this:

BULK INSERT ReduceData_View  FROM 'path\part-r-00000' 
WITH (FIRSTROW = 1,FIELDTERMINATOR = '\t', ROWTERMINATOR = '\n');

The reduce files are rows of data that are tab separated. The frustrating part is that I had failed to remember that the map reduce job was done on a Linux machine, and therefore each file contained Unix style line endings.  I had been viewing the files in Notepad++ (which will display different file formats without complaint), so it was confusing to me when I would see errors. I changed the ROWTERMINATOR value to '\r\n', and everything worked fine.  

Lesson - pay attention to the source system type when processing data files.  I didn't spend a long time on this, but it was still frustrating.

SQL Views

I had created a view to insert into, because I had an identity column on my target table.  It seemed like an easy way to avoid the problem of bulk insert trying to insert into the identity column.  The problem occurred when I ran the SQL to do the inserts, and some of the columns from the source files had values that were too wide for the table columns.  I changed the target table columns to use varchar(max), since some of the source data had values that were larger than 4000 chars.  I ran the bulk insert again and I still had errors due to the data being too large for one of the target columns.  What the heck?  What I hadn't realized is that the view I created had information for the columns in the target table from when the target table was first created.  The table had the column width changed to varchar(max), but the view thought the table column width was varchar(4000).  I ran the SQL to recreate the view and then the bulk insert worked just fine.




Sunday, May 12, 2013

Paired programming as a way to share knowledge can sometimes be dangerous...

Paired programming as a way to share knowledge is a great idea - especially if it is for work on legacy code. It can help someone who needs to update code avoid wasting time trying to understand the structure of the code when a person familiar with the code can quickly give a tour and point out all the important bits. 

However, pairing as a way to share knowledge can sometimes be dangerous. I recently had a learning experience for how not to do paired programming.

I was given the task to update a web service that was created by a sibling development team. The story was split into multiple tasks. 

I was paired with a person from the other dev team for one of the tasks. I'll call him Adam. We made our changes, and our unit tests gave coverage for what we were able to test. Everything worked as we expected. 

I was paired with another person for the next task. I'll call him Bob. This is where the paired programming started to transform into pear shaped programming.

Lesson 1: Strive for consistency, or at least a common vision.

Try to avoid pairing with multiple people on one story as a way to learn about a new system unless each person has a good idea of what the overall user story is covering.

The problem that occurred was that Adam and Bob only knew about the work in the specific task where they paired with me. There was a dependency on code from the first task where I paired with Adam that wasn't completely obvious when working on the code for the second task where I paired with Bob. All of the unit tests that we created for both tasks passed, and the bit of manual integration testing we did appeared to pass. However, there was a bit of code that needed to be updated when working on the task with Bob that was missed. This probably would have been noticed by Bob had he been more aware of what the changes were for the task I worked on with Adam. 


Lesson 2: Make your pairing partner accountable.

Make sure that the person you are pairing with attends your scrum.  

Bob treated the situation as though he was doing a bit of side work, and would pull other tasks to work on. We should have both been solely focused on the one task until it was accepted as done and ready to ship. Bob might have been more likely to stick with the task and treat it as work that has his name attached to it if he had attended our scrums. I should have said something, but I also treated the situation as though Bob was just helping out instead of being an equal partner.

The result of the misguided pairing was that we shipped a bug to production. We spotted the bug and were able to fix it before it could impact customers, but it took time from us being able to work on other tasks.   

Lesson 3: Do your homework. 

If your name is attached to some work, then make sure you understand the code you are touching well enough to explain it to someone else. Don't assume that the person you are pairing with is not going to miss some bit of code that needs to be updated just because they are familiar with the code.

I should have made sure that I knew how every bit of code worked that I was touching, and how the code was being called. If I had, then I would have caught the missing code change. Instead I accepted quick explanations from people already familiar with the code, and assumed that I was "learning" what was important. What I had done was basically the same as listening to a teacher talk about a topic, but not bothering to do any homework to make sure that I understood what was being said. 

Monday, May 6, 2013

Things I learned while using AWS SQS...

Updated 03-20-2017

Amazon's Simple Queue Service (SQS) provides an easy to use mechanism for sending and receiving messages between various applications/processes. Here are a few things that I learned while using the AWS Java SDK to use SQS.


SQS is not can be FIFO

It used to be that AWS SQS didn't guarantee FIFO ordering. Now you can create a standard queue or a FIFO queue. However, there are some differences to be aware between standard and FIFO queues that are worth pointing out. The differences can be read about here. Here are some of the key differences:


Standard Queues - available in all regions, nearly unlimited transactions per second, messages will be delivered at least once but might be delivered more than once, messages might be delivered out of order.

FIFO Queues - available in US West (Oregon) and US East (Ohio), 300 transactions per second, messages are delivered exactly once, order of messages is preserved (as the queue type suggests).

SQS Free Usage Tier

The SQS free usage tier is determined by the number of requests you make per month.  You can make up to 1 million requests per month.  The current fee is $.50 per million requests after the first million requests. The cost is pretty low, but it would be easy to start racking up millions of requests. Luckily, there are batch operations that can be done, and each batch operation is considered one request.


Short Polling/Long Polling

You can set a time limit to wait when polling queues for messages. Short polling is when you make a request to receive messages without setting the ReceiveMessageWaitTimeSeconds property for the queue. Setting the ReceiveMessageWaitTimeSeconds property to up to 20 seconds (20 seconds is the maximum wait time) will cause your call to wait up to 20 seconds for a message to appear on the queue before returning.  If there is a message on the queue, then the call will return immediately with the message.  The advantage to using long polling is that you will make less requests without receiving messages. 


One thing to remember is that if you have only one thread being used to poll multiple queues, then you will have unnecessary wait times when only some of the queues have messages waiting.  A solution to that problem is to use one thread for each queue being polled.


Something that seemed a bit contradictory is that queues created through the web console have the ReceiveMessageWaitTimeSeconds set to 0 seconds (meaning it is going to use short polling). However, the FAQ mentions that the AWS SDK uses 20 second wait times by default. I created a queue using the AWS SDK, and the wait time was listed as 0 seconds in the web console. I shouldn't have to specifically set the wait time property to 20 seconds if the default wait time is 20 seconds.  Perhaps the documentation just hasn't been updated yet.


Message Size


The message size can be up to 256 KB in size. If you plan on using SQS as a way to manage a data process flow then you might want to consider how easy it is to reach the 256 KB limit.  Avoid putting data into the queue messages.  Instead, use the messages as notifications for work that needs to be done, and include information that identifies which data is ready to be processed. This is especially important to remember since the messages in the queue can be out of order, and you don't want to count on the data embedded in a message as being the latest version of the data. 


Message TTL On Queues


Messages have a default life span of 4 days on queues, but can be set to be kept for 1 minute to 2 weeks. 


Amazon May Delete Unused Queues


Amazon's FAQ mentions that queues may be deleted if no activity has occurred for 30 days.


JARs Used By AWS Java SDK


There are certain jar files that you will need to reference when using the AWS Java SDK.  They are located in the SDKs "third-party" folder. Here are the jar files I referenced while using the SQS APIs:

  • third-party/commons-logging-1.1.1/commons-logging-1.1.1.jar
  • third-party/httpcomponents-client-4.1.1/httpclient-4.1.1.jar
  • third-party/httpcomponents-client-4.1.1/httpcore-4.1.jar

Elastic Load Balancers in AWS have a pretty confusing message...

I had an issue the other day with AWS an Elastic Load Balancer (ELB) that said the instances I had assigned to the load balancer were "Out of Service".  There was a link that was displayed as "(why?)", and would display the hint text of "Instance is in stopped state."  This was particularly confusing, because the EC2 console displayed the instances as running.

It turns out that the problem was with the load balancer settings.  Doing a search revealed that the error message "Instance is in stopped state." will be displayed when the health check fails.  It turns out that the problem was that the health check ping target was pointing to the wrong location (a web page that didn't exist).

I wish that the AWS console would have listed a suggestion of "Please confirm that the health check ping target is correct." instead of just listing an invalid assumption that the instance was in a stopped state.  Or, have the "(why?)" anchor display a page of possible troubleshooting steps. One of the suggested steps could still mention the possibility that the instance is stopped.

In the end it was resolved somewhat quickly, but it could have been a lot less stressful if the information provided was more accurate and more helpful.

Tuesday, April 16, 2013

Command line chaining sure is nice...

Every now and then I will need to validate code changes that will update a bunch of data files (200 or so) where each data file is fairly large (over 100 MB).  I could use vi/vim to open one of the files to do searching, but that takes quite a while.  Most of my text editors won't handle such large files.  It just so happens that the files are tab delimited, and therefore easy to parse and read with awk and grep.

If the data looks like this (but with millions of permutations of something similar across a couple hundred files):

datavalue1<TAB>datavalue2<TAB>SpecialFieldValue1<TAB>1<TAB>datavalue3
datavalue4<TAB>datavalue5<TAB>SpecialFieldValue2<TAB>2<TAB>datavalue6
datavalue7<TAB>datavalue8<TAB>SpecialFieldValue1<TAB>3<TAB>datavalue9

And I want to only see the values for the fourth column for all rows that have the SpecialFieldValue2, then I will use a command similar to this:

grep -P '\tSpecialFieldValue2\t' * | awk '{print $4}' > SpecialFieldValue2_values.txt

The -P tells grep to use Perl style regular expression, so I can use '\t' to represent tab characters. The * is the file name mask, so this will grep every file in the current directory.

I can then look through the SpecialFieldValue2_values.txt file to see that the data is what I expected.


Friday, March 22, 2013

Regex in C# to get UTC date...

I recently needed to find a UTC date in a string of text, and I thought it might be handy to pull the various values from the dates that were found by getting the values from the returned Groups for each Match. Groups are identified in your regular expression string by surrounding sections with parentheses. The first Group of every match is the string value that the Match found. Every other group in the Match is identified by the parentheses going from left to right. So, if you have a regular expression that looks like this:
"((\d\d\d\d)-(\d\d)-(\d\d))T((\d\d):(\d\d):(\d\d))Z"
Imagine that someone used the above regular expression on the following string.
"This is a UTC date : 1999-12-31T23:59:59Z. Get ready to party!"
The very first group (after the matched date/time string) will be the entire date, because the first set of parentheses completely wraps the date portion of the string.
"1999-12-31"
The next group would be the year portion of the date, since the next set of parentheses completely wraps the year.
"1999"
That pattern is repeated for the rest of the regular expression string. If no parentheses (groupings) are specified, then there will only be the one group and it will contain the string that the regular expression matched. Here is an example of how to do this in code:
static void Main(string[] args)
{
    string input = "this\tis\ta test 2013-03-21T12:34:56Z\tand\tanother date\t2013-03-21T23:45:01Z";
    string regexString = @"((\d\d\d\d)-(\d\d)-(\d\d))T((\d\d):(\d\d):(\d\d))Z";
    TestRegex(input, regexString);
}

private static void TestRegex(string input, string regexString)
{
    int matchCount = 0;
    foreach (Match match in Regex.Matches(input, regexString))
    {                
        int groupCount = 0;
        foreach (Group group in match.Groups)
        {
            Console.WriteLine("Match {0}, Group {1} : {2}", 
                                matchCount, 
                                groupCount++, 
                                group.Value);    
        }
        matchCount++;
    }
}
Here is the output:
Match 0, Group 0 : 2013-03-21T12:34:56Z
Match 0, Group 1 : 2013-03-21
Match 0, Group 2 : 2013
Match 0, Group 3 : 03
Match 0, Group 4 : 21
Match 0, Group 5 : 12:34:56
Match 0, Group 6 : 12
Match 0, Group 7 : 34
Match 0, Group 8 : 56
Match 1, Group 0 : 2013-03-21T23:45:01Z
Match 1, Group 1 : 2013-03-21
Match 1, Group 2 : 2013
Match 1, Group 3 : 03
Match 1, Group 4 : 21
Match 1, Group 5 : 23:45:01
Match 1, Group 6 : 23
Match 1, Group 7 : 45
Match 1, Group 8 : 01

Thursday, March 21, 2013

The return of Super Sed and Wonder Awk...

I needed to compare the tabbed separated data of a file (file A) to expected data (file B).  However, the contents of file A contained the processing date in each line of output in the file.

To handle the issue of non-matching dates, the "processing date" in file B was updated to be the string "PROCESSING_DATE".  That just left the date in file A to contend with.

Here is where sed and awk came to the rescue.  I used head -n1 to get the first line of file A, and used awk to get the processing date (which appeared in the 11th column).  The processing date was stored in a variable named target_date. Next, I used sed to do a replacement on all instances of target_date in file A.  After which I was able to do a diff on the two files to see if the output was as expected.

Here is how it looked in the shell script:

# get the target date
target_date=`head -n1 fileA.txt |  awk '{print $11}'`
# get the sed argument using the target date
sed_args="s/$target_date/PROCESSED_DATE/g"

# do an inline replacement on the target file
sed -i $sed_args fileA.txt

# check for differences
diffs=`diff fileA.txt fileB.txt`

if [ -n $diffs ]; then
    echo "There were differences between the files."
else
    echo "No differences were found."
fi

Wednesday, March 20, 2013

Blog title image disappeared...

I have no idea what happened, but my blog title image disappeared.  I'm not sure how long it was missing, but perhaps a couple of days. I ended up uploading the image to a website, and then using the URL to reference where the image was located.  

I found some recent information from others where their images disappeared, but nothing that seemed identical to my situation. The other mentions of missing images all pointed to Picasa web albums being deleted, and that was why the images were missing. I didn't delete anything recently, so I don't think that was my issue. However, I assumed that when the layout control gives you an option to select an image from your computer that it was copying the image to whatever disk space is used to host your blog entries.  Perhaps that was a bad assumption.

Has anyone else experienced the problem of blogger title images disappearing?

Wednesday, March 13, 2013

Using Amazon's AWS S3 via the AWS .Net SDK

Amazon's AWS S3 (Simple Storage Service) is incredibly easy to use via the AWS .Net SDK, but depending on your usage of S3 you might have to pay. S3 has a free usage tier option, but the amount of space allowed for use is pretty small by today's standards (5GB). The upside is that even if you end up going outside of the parameters for the free usage tier it is still cheap to use.

Here is some information from Amazon regarding the free usage tier limits for S3:

  • 5 GB of Amazon S3 standard storage, 20,000 Get Requests, and 2,000 Put Requests
  • These free tiers are only available to existing AWS customers who have signed-up for Free Tier after October 20, 2010 and new AWS customers, and are available for 12 months following your AWS sign-up date. When your free usage expires or if your application use exceeds the free usage tiers, you simply pay standard, pay-as-you-go service rates (see each service page for full pricing details). Restrictions apply; see offer terms for more details.


Sign Up To Use AWS

You need to create an account in order to use the Amazon Web Services. Make sure you read the pricing for any service you use so you don't end up with surprise charges. In any case, go to http://aws.amazon.com/ to sign up for an account if you haven't done so already.

Install or Reference the AWS .Net SDK

To start using the AWS .Net SDK to access S3 you will want to either download the SDK from Amazon or use NuGet via Visual Studio. Start Visual Studio (this example is using Visual Studio 2010), and do the following to use NuGet to fetch the AWS SDK:


  • Select the menu item "Tools | Library Package Manager | Manage NuGet Packages For Solution..."
  • Type "AWS" in the "Search Online" search text box
  • Select "AWS SDK for .Net" and click the "Install" button
  • Click "OK" on the "Select Projects" dialog

Create a Project and Use the AWS S3 API

Create a project in Visual Studio, and add the following code:


string key = "theawskeythatyougetwhenyousignuptousetheapis";
string secretKey = "thesecretkeyyougetwhenyousignuptousetheapis";

// create an instance of the S3 TransferUtility using the API key, and the secret key
var tu = new TransferUtility(key, secretKey);

// try listing any buckets you might have
var response = tu.S3Client.ListBuckets();

foreach(var bucket in response.Buckets)
{
   Console.WriteLine("{0} - {1}", bucket.BucketName, bucket.CreationDate);

   // list any objects that might be in the buckets
   var objResponse = tu.S3Client.ListObjects(
      new ListObjectsRequest 
      {
         BucketName = response.Buckets[0].BucketName
      }
   );

   foreach (var s3obj in objResponse.S3Objects)
   {
      Console.WriteLine("\t{0} - {1} - {2} - {3}", s3obj.ETag, s3obj.Key, s3obj.Size, s3obj.StorageClass);
   }
}

// create a new bucket
string bucketName = Guid.NewGuid().ToString();
var bucketResponse = tu.S3Client.PutBucket(new PutBucketRequest
   {
      BucketName = bucketName
   }
);

// add something to the new bucket
tu.S3Client.PutObject(new PutObjectRequest
   {
      BucketName = bucketName,
      AutoCloseStream = true,
      Key = "codecog.png",
      FilePath = "C:\\Temp\\codecog.png"
   }
);

// now list what is in the new bucket (which should only have the one item)
var bucketObjResponse = tu.S3Client.ListObjects(
   new ListObjectsRequest
   {
      BucketName = bucketName
   }
);

foreach (var s3obj in bucketObjResponse.S3Objects)
{
   Console.WriteLine("{0} - {1} - {2} - {3}", s3obj.ETag, s3obj.Key, s3obj.Size, s3obj.StorageClass);
}

Thursday, March 7, 2013

Micro ORM Review - FluentData

Who doesn't love tools that make your life easier? Make a database connection and populate an object in just a few lines of code and one config setting? That's what FluentData can offer. Sign me up! 

Some Features:
  • Supports a wide variety of RDBMs - MS SQL Server, MS SQL Azure, Oracle, MySQL, SQLite, etc.
  • Auto map, or use custom mappers, for your POCOs (or dynamic type).
  • Use SQL, or SQL builders, to insert, update, or delete data.
  • Supports stored procedures.
  • Uses indexed or named parameters.
  • Supports paging.
  • Available as assembly (download the DLL or use NuGET) and as a single source code file.
  • Supports transactions, multiple resultsets, custom return collections, etc.

Pros:
  • Setting up connection strings in a config file, and then passing the key value to a DbContext to establish a connection is such an easy way to do things.  It made it very easy to have generic code point to various databases.  I'm sure that is the intent. Needing to declare a connection object, set the connection string value for the connection object, and then calling the connection object's "Open" method seems undignified. :D It's really not that big of a deal, but I like that it seemed much more straight forward using FluentData.
  • It's very easy to start using FluentData to select, add, update, or delete data from your database.
  • It is easy to use stored procedures.
  • Populating objects from selects, or creating objects and using them to insert new data into your database is almost seamless.
  • Populating more complex objects from selects is fairly easy using custom mapper methods.
  • The exceptions that are thrown by FluentData are actually helpful. The contributors to/creators of FluentData have been very thoughtful in how they return error information.


Cons:
  • I had some slight difficulty setting a parameter for a SQL select when the parameter was used as part of a "like" for a varchar column.  The string value in the SQL looked like this: '@DbName%'.  I worked around the issue by changing the code to use this instead: '@DbName', and then set the value so that it included the %.

I originally thought that I couldn't automap when the resultsets return columns that don't map to properties of the objects (or are missing columns for properties in the target object) without using a custom mapping method. However, there is a way - you can call a method on the DB context to say that automapping failures should be ignored:

Important configurations
  • IgnoreIfAutoMapFails - Calling this prevents automapper from throwing an exception if a column cannot be mapped to a corresponding property due to a name mismatch.
Example Usage:

First, I created a MySQL database to use as a test. I created a database called ormtest, and then created a couple of tables for holding book information:

create table if not exists `authors` (
  `authorid` int not null auto_increment,
  `firstname` varchar(100) not null,
  `middlename` varchar(100),
  `lastname` varchar(100) not null,
  primary key (`authorid` asc)
);

create table if not exists `books` (
 `bookid` int not null auto_increment,
 `title` varchar(200) not null,
 `authorid` int,
 `isbn` varchar(30),
 primary key (`bookid` asc)
);

Next, I created a Visual Studio console app, added an application configuration file, and added a connection string for my database:


  
    
  


Then I created my entity types:

public class Author
{
 public int AuthorID { get; set; }
 public string FirstName { get; set; }
 public string MiddleName { get; set; }
 public string LastName { get; set; }

 public string ToString()
 {
  if (string.IsNullOrEmpty(MiddleName))
  {
   return string.Format("{0} - {1} {2}", AuthorID, FirstName, LastName);
  }
  return string.Format("{0} - {1} {2} {3}", AuthorID, FirstName, MiddleName, LastName);
 }
}
public class Book
{
 public int BookID { get; set; }
 public string Title { get; set; }
 public string ISBN { get; set; }
 public Author Author { get; set; }

 public string ToString()
 {
  if (Author != null)
  {
   return string.Format("{0} - {1} \n\t({2} - {3})", BookID, Title, ISBN, Author.ToString());
  }
  return string.Format("{0} - {1} \n\t({2})", BookID, Title, ISBN);
 }
}
I was then able to populate a list of books by selecting rows from the books table:
public static void PrintBooks()
{
 IDbContext dbcontext = new DbContext().ConnectionStringName("mysql-inventory", new MySqlProvider());
 const string sql = @"select b.bookid, b.title, b.isbn
         from books as b;";
   
 List<Book> books = dbcontext.Sql(sql).QueryMany<Book>();

 Console.WriteLine("Books");
 Console.WriteLine("------------------");
 foreach (Book book in books)
 {
  Console.WriteLine(book.ToString());
 }
}
Unfortunately I wasn't able to select columns from the table that didn't have matching attributes in the entity type. You'll need to create a custom mapping method in order to select extra columns that don't map to any attributes in the entity type. You can also use custom mapping methods to populate entity types that contain attributes of other entity types).

Here is an example:
public static void PrintBooksWithAuthors()
{
 IDbContext dbcontext = new DbContext().ConnectionStringName("mysql-inventory", new MySqlProvider());

 const string sql = @"select b.bookid, b.title, b.isbn, b.authorid, a.firstname, a.middlename, a.lastname, a.authorid 
         from authors as a 
        inner join books as b 
        on b.authorid = a.authorid 
        order by b.title asc, a.lastname asc;";

 var books = new List<Book>();
 dbcontext.Sql(sql).QueryComplexMany<Book>(books, MapComplexBook);

 Console.WriteLine("Books with Authors");
 Console.WriteLine("------------------");
 foreach (Book book in books)
 {
  Console.WriteLine(book.ToString());
 }
}

private static void MapComplexBook(IList<Book> books, IDataReader reader)
{
 var book = new Book
 {
  BookID = reader.GetInt32("BookID"),
  Title = reader.GetString("Title"),
  ISBN = reader.GetString("ISBN"),
  Author = new Author
  {
   AuthorID = reader.GetInt32("AuthorID"),
   FirstName = reader.GetString("FirstName"),
   MiddleName = reader.GetString("MiddleName"),
   LastName = reader.GetString("LastName")
  }
 };
 books.Add(book);
}


And here is an example of an insert, update, and delete:
public static void InsertBook(string title, string ISBN)
{
 IDbContext dbcontext = new DbContext().ConnectionStringName("mysql-inventory", new MySqlProvider());

 Book book = new Book
 {
  Title = title,
  ISBN = ISBN
 };

 book.BookID = dbcontext.Insert("books")
         .Column("Title", book.Title)
         .Column("ISBN", book.ISBN)
         .ExecuteReturnLastId<int>();

 Console.WriteLine("Book ID : {0}", book.BookID);
 
}

public static void UpdateBook(Book book)
{
 IDbContext dbcontext = new DbContext().ConnectionStringName("mysql-inventory", new MySqlProvider());
 book.Title = string.Format("new - {0}", book.Title);

 int rowsAffected = dbcontext.Update("books")
        .Column("Title", book.Title)
        .Where("BookId", book.BookID)
        .Execute();

 Console.WriteLine("{0} rows updated.", rowsAffected);
}

public static void DeleteBook(Book book)
{
 IDbContext dbcontext = new DbContext().ConnectionStringName("mysql-inventory", new MySqlProvider());

 int rowsAffected = dbcontext.Delete("books")
        .Where("BookId", book.BookID)
        .Execute();

 Console.WriteLine("{0} rows deleted.", rowsAffected);
}


Summary:
FluentData has been fairly easy to use and there appears to be a way to accomplish whatever I want to do. If FluentData's documentation had more examples of how to populate entity types (POCOs), then it would have saved me a little bit of time. As it is, the documentation listed multiple ways to accomplish tasks, so it never took long to find a method that would work.

Thursday, February 28, 2013

sed to the rescue...

I had one of those dreaded tasks at work today - update a bazillion (okay, actually just 24, but it felt like a lot) servers with a config file that needs to be modified to reference the name of each of the servers. I either had to update each server in turn, or I could write a script to do it for me.  

I decided to write a script, and I will pretend that it took less time to write and run the script than doing it manually.  It wasn't perfect - I had to copy/paste the password for each server because I didn't want to mess around with using keygen. I also didn't want to install software like expect.  Expect is a software utility that allows you to write scripts that provide interactive programs with the parameters they might...well...expect.  In this case it would have been handy to use since it could provide scp with the user's password. The script I wrote did what I needed it to, and I'm sure I'll be able to use something like this script in the future.    

Assuming that the source file is on server01, and the source file to be updated and copied is named somefile.conf, then this is what the script looked like:

#!/bin/sh

servers="server02 server03 server04 server05 server06 server07 server08 server09 server10 server11 server12 server13 server14 server15 server16 server17 server18 server19 server20 server21 server22 server23 server24"
filepath="/some/file/path/somefile.conf"

for server in $servers
do
      echo "Working on $server."
      SEDARG="s/server01/$server/g"
      sed $SEDARG $filepath > $filepath.$server
      scp $filepath.$server someuser@$server:$filepath
      rm $filepath.$server
done
echo "Done."

Tuesday, February 19, 2013

Solr - HTMLStripCharFilter...

I am attempting to store a bit of data that I fetch from a website in Solr.  The data sometimes has HTML markup, so I decided to use the HTMLStripCharFilterFactory in the fields analyzer.

Here is an example of the field type that I created:

<fieldType name="strippedHtml" class="solr.TextField">
   <analyzer>
      <charFilter class="solr.HTMLStripCharFilterFactory"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>      
      <filter class="solr.LowerCaseFilterFactory" />
   </analyzer>
</fieldType>


I used the field type of strippedHtml in a field called itemDescription, and when I do a search after indexing some data I can see that the itemDescription contains data that still has HTML markup.  I used the analyzer tab in Solr to see what would happen on index of HTML data, and I could see that none of the markup appears to be stripped out.

It turns out that most of the HTML was encoded so that the angle bars are replaced with the escaped values.  I will need to find a way to remove the escaped values.

Friday, February 8, 2013

Code puzzler on DZone and optimizations...

Something that I've really enjoyed about DZone.com is that they have recurring themes for certain posts. One recurring set of posts is the Thursday Code Puzzler. Now and then there will be a really interesting solution. For example, one puzzler was to count the number of 1's that occurred in a set of integers. ie, {1, 2, 11, 14} = 4 since the number 1 occurs 4 times in that set. One of the solutions used map reduce to come up with the solution. I thought that was particularly neat.

I decided to try out the recent code puzzler for finding the largest palindrome in a string. Here was my first method:

public static int getLargestPalindrome(String palindromes) {

        String[] palindromeArray = palindromes.split(" ");
        int largestPalindrome = -1;

        for(String potentialPalindrome : palindromeArray) {
            if (potentialPalindrome.equals(new StringBuffer(potentialPalindrome).reverse().toString()) && potentialPalindrome.length() > largestPalindrome)
                largestPalindrome = potentialPalindrome.length();
        }

        return largestPalindrome;

    }

It works, but it felt like cheating to use the StringBuffer.reverse() method. Here is my second method:
    public static int getLargestPalindrome(String palindromes) {

        int largestPalindrome = -1;

        for(String potentialPalindrome : palindromes.split(" ")) {
            int start = 0;
            int end = potentialPalindrome.length() - 1;
            boolean isPalindrome = true;

            while (start <= end) {
                if (potentialPalindrome.charAt(start++) != potentialPalindrome.charAt(end--)) {
                    isPalindrome = false;
                    break;
                }
            }

            if (isPalindrome && potentialPalindrome.length() > largestPalindrome) {
                largestPalindrome = potentialPalindrome.length();
            }

        }

        return largestPalindrome;

    }

It works faster than the first method. I'm sure the performance improvement is mainly due to the first method's creation of new StringBuffer objects (and also new Strings for the reversed value) for each potential palindrome, but accessing locations in an array for half the length (worst case in 2nd version) is bound to be less work than (worst case in 1st version) comparing every character in a string to another string.

Saturday, February 2, 2013

Thread pool example using Java and ExecutorService...

Using thread pools is something that is very easy to implement using Java's ExecutorService. The Java ExecutorService class allows you to specify the number of asynchronous tasks that you want to process. Here is an example of an ExecutorService class being instantiated where numThreads is an integer specifying the number of threads to create for the thread pool:
ExecutorService executorService = Executors.newFixedThreadPool(numThreads);
You pass a runnable in the ExecutorService's execute method like this:
executorService.execute(someRunnableObject);
I created a sample method that uses an ExecutorService to similute working on text files. It will move files from the source path to an archive path unless there is a "lock" file found. The "lock" file is an empty file that is named identically to one of the files that is being "worked" on. The lock file is used to ensure that the same file isn't attempted to be worked on by multiple threads. I made this sample because I figured this might be a nice way to handle indexing data in csv files to a Solr server. Here is the method that does the work (which would be very poorly named if it weren't sample code):
public static void DoWorkOnFiles(String sourcePath, String archivePath, int numThreads) throws IOException {

    Random random = new Random();
    ExecutorService executorService = Executors.newFixedThreadPool(numThreads);

    File sourceFilePath = new File(sourcePath);
    if (sourceFilePath.exists()) {
        Collection<java.io.File> sourceFiles = FileUtils.listFiles(sourceFilePath, new String[]{"txt"}, false);

        for (File sourceFile : sourceFiles) {
            File lockFile = new File(sourceFile.getPath() + ".lock");
            if (!lockFile.exists()) {
                executorService.execute(new SampleFileWorker(sourceFile.getPath(), archivePath, random.nextInt(10000)));
            }
        }
        // This will make the executor accept no new threads
        // and finish all existing threads in the queue
        try {
            executorService.shutdown();
            executorService.awaitTermination(10000, TimeUnit.MILLISECONDS);
        } catch (InterruptedException ignored) {
        }
        System.out.printf("%nFinished all threads.%n");
    }
    else {
        System.out.printf("%s doesn't exist. No work to do.%n", sourceFilePath);
    }
}
The SampleFileWorker class looks like this:
import org.apache.commons.io.FileUtils;
import java.io.File;
import java.io.IOException;

public class SampleFileWorker implements Runnable {

    private final String sourcePath;
    private final String archivePath;
    private final int testDelay;

    public SampleFileWorker(String sourcePath, String archivePath, int testDelay) {
        this.sourcePath = sourcePath;
        this.archivePath = archivePath;
        this.testDelay = testDelay;
    }

    @Override
    public void run() {

        try {
            File lockFile = new File(sourcePath + ".lock");
            if (!lockFile.exists()) {
                lockFile.createNewFile();
            } else {
                return;
            }

            File sourceFile = new File(sourcePath);
            String archiveFilePath = archivePath.concat(File.separator + sourceFile.getName());
            File archiveFile = new File(archiveFilePath);

            System.out.printf("Simulating work on file %s.%n", sourcePath);
            System.out.printf("Starting: %s%n", sourcePath);
            System.out.printf("Delay:    %s%n", testDelay);

            try {
                Thread.sleep(testDelay);
            } catch (InterruptedException ignored) {
            }

            System.out.printf("Done with: %s%n", sourcePath);
            System.out.printf("Archiving %s to %s.%n", sourceFile, archivePath);

            FileUtils.moveFile(sourceFile, archiveFile);
            sourceFile.delete();
            lockFile.delete();

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Monday, January 28, 2013

Follow-up for Connection Pooling...

I had mentioned in one of my recent posts that I was told we are not using connection pooling in our C# web service code. I had done some reading, and it seemed pretty clear that connection pooling is enabled by default in the .Net framework, and that you would need to explicitly disable pooling.

I looked through our code and I couldn't find any instance where we set pooling to false for our connection strings. I also wrote some simple code to test if the classes we are using (from Microsoft.Practices.EnterpriseLibrary.Data) might do something odd that would prevent connection pooling, or require us to explicitly state that we want to use connection pooling.

The result of my test code was that the code where I explicitly set "pooling=false" took a longer time to execute than the code without "pooling=false". That shouldn't be surprising, but I was told by multiple people that the code was not using connection pooling. It is understandable why people thought the code was not using pooling, though. For one, there sometimes is a feeling that as code gets old, then the quality of the design degrades. Also, there are lots of coding styles and myths that get passed around as fact. This has been a concrete example for me that everyone should try things out before just accepting something as a fact when it comes to writing software. In addition, it pays to read the documentation!

The test code looks like this:

private const string pooled =
    @"Data Source=someserver;Initial Catalog=dbname;Integrated Security=SSPI;";
private const string notpooled =
    @"Data Source=someserver;Initial Catalog=dbname;Integrated Security=SSPI;Pooling=false;";

public static void Main(string[] args)
{
    int testCount = 10000;
    
    GenericDatabase db = new GenericDatabase(pooled, SqlClientFactory.Instance);
    Console.WriteLine("    pooled : {0}", RunConnectionTest(db, testCount));

    db = new GenericDatabase(notpooled, SqlClientFactory.Instance);
    Console.WriteLine("not pooled : {0}", RunConnectionTest(db, testCount));
}

private static TimeSpan RunConnectionTest(GenericDatabase db, int testCount)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();
    DbConnection conn;
    for (int i = 0; i < testCount; i++)
    {
        conn = db.CreateConnection();
        conn.Open();
        var schema = conn.GetSchema(); // just to make the code do something
        conn.Close();
    }
    sw.Stop();
    return sw.Elapsed;
}


I specifically did not use using statements for the connection. I wanted to mimic what the service code is doing.

The result of the test was that the loop using the connection string that implicitly uses connection pooling finished in just over 1 second. The loop for the connection string that explicitly says to not use connection pooling took about 30 seconds.

I'm pretty happy to know that we don't need to do any updates to our code to make it use connection pooling!