Tuesday, April 16, 2013

Command line chaining sure is nice...

Every now and then I will need to validate code changes that will update a bunch of data files (200 or so) where each data file is fairly large (over 100 MB).  I could use vi/vim to open one of the files to do searching, but that takes quite a while.  Most of my text editors won't handle such large files.  It just so happens that the files are tab delimited, and therefore easy to parse and read with awk and grep.

If the data looks like this (but with millions of permutations of something similar across a couple hundred files):

datavalue1<TAB>datavalue2<TAB>SpecialFieldValue1<TAB>1<TAB>datavalue3
datavalue4<TAB>datavalue5<TAB>SpecialFieldValue2<TAB>2<TAB>datavalue6
datavalue7<TAB>datavalue8<TAB>SpecialFieldValue1<TAB>3<TAB>datavalue9

And I want to only see the values for the fourth column for all rows that have the SpecialFieldValue2, then I will use a command similar to this:

grep -P '\tSpecialFieldValue2\t' * | awk '{print $4}' > SpecialFieldValue2_values.txt

The -P tells grep to use Perl style regular expression, so I can use '\t' to represent tab characters. The * is the file name mask, so this will grep every file in the current directory.

I can then look through the SpecialFieldValue2_values.txt file to see that the data is what I expected.