Thursday, August 27, 2015

Simple gremlin queries in Titan...

Titan is an open source graph database and, even though it isn't as easy to setup for use as Neo4J, it is easy enough to start using in just a few minutes. Here is an example of using Titan with HBase as the backend storage.

Setup

I'm using HBase 0.98.13. I downloaded hbase, untarred the files, changed to the hbase directory and ran "bin/start-hbase.sh".

I cloned Titan from the github repository and built it using the maven package command.

git clone https://github.com/thinkaurelius/titan.git
cd titan
mvn clean package

I started gremlin using "bin/gremlin.sh".

I followed the Titan Hbase instructions for initializing Titan for use with HBase within the gremlin console. You can also create a properties file that contains the same settings and load the settings from the gremlin console.

Gremlin

conf = new BaseConfiguration();
conf.setProperty("storage.backend","hbase");
conf.setProperty("storage.hbase.table", "test") 

g = TitanFactory.open(conf);

Now let's add some vertices.

alice = g.addVertexWithLabel('human')
alice.setProperty('name', 'alice')
alice.setProperty('age',25)
bob = g.addVertexWithLabel('human')
bob.setProperty('name', 'bob')
bob.setProperty('age',21)
clark = g.addVertexWithLabel('human')
clark.setProperty('name', 'clark')
clark.setProperty('age',93)
darwin = g.addVertexWithLabel('human')
darwin.setProperty('name', 'darwin')
darwin.setProperty('age',206)
ernie = g.addVertexWithLabel('android')
ernie.setProperty('name', 'ernie')


Let's list the vertices and their properties.

g.V().map()
==>{name=ernie}
==>{name=alice, age=25}
==>{name=darwin, age=206}
==>{name=clark, age=93}
==>{name=bob, age=21}

And now let's make these humans be friends with each other.

alice.addEdge('friend', bob)
alice.addEdge('friend', darwin)
bob.addEdge('friend', alice)
bob.addEdge('friend', darwin)
clark.addEdge('friend', darwin)
darwin.addEdge('friend',alice)
darwin.addEdge('friend', bob)
darwin.addEdge('friend', clark)



Now let's remove ernie from the graph.

g.V.has('name', 'ernie').remove()
==>null

Now we can see that ernie is gone

g.V.has('name', 'ernie').map()

(no results displayed, just the gremlin prompt)

Let's add ernie back, but this time he's a human.

ernie = g.addVertexWithLabel('human')
ernie.setProperty('name', 'ernie')



Let's try finding out who has friends

g.V().outE('friend').outV().name
==>darwin
==>darwin
==>darwin
==>alice
==>alice
==>bob
==>bob
==>clark


Wait - what happened? We see an entry for every friend edge, which is exactly what our gremlin query was asking for, but that doesn't look very nice.

Let's try the dedup method.

g.V().outE('friend').outV().dedup().name
==>darwin
==>alice
==>bob
==>clark


Ahh! That's more like it! But how else can we get that list?

g.V.filter{it.outE('friend').hasNext()}.toList()._().name
==>darwin
==>alice
==>bob
==>clark


Nice! We have two ways to get a distinct list.

Friday, August 14, 2015

Add and remove fields from Solr schema using Schema API...

Okay - I know what you're thinking. "How can I quickly update my Solr schema without having to go into the schema config file - perhaps using a rest API?" You can use the Solr Schema API, that's how! 

I recently noticed that there is a Schema API in Solr 5.X that can be used to update the Solr schema. You need to have the schemaFactory set to use the "ManagedIndexSchemaFactory", and have the mutable property set to true.  If you want to stop allowing the schema from being updated via the API, then you can change the mutable property to false.

Here are a few of the things that you can do with the schema API:

View the schema for a collection:

http://localhost:8983/solr/yourcollectionname/schema

View all of the fields in the schema:

http://localhost:8983/solr/yourcollectionname/schema/fields

Example output:

{
  "responseHeader":{
    "status":0,
    "QTime":101
  },
  "fields":[
  {
      "name":"_text_",
      "type":"text_general",
      "multiValued":true,
      "indexed":true,
      "stored":false},
    {
      "name":"_version_",
      "type":"long",
      "indexed":true,
      "stored":true},
    {
      "name":"id",
      "type":"string",
      "multiValued":false,
      "indexed":true,
      "required":true,
      "stored":true,
      "uniqueKey":true},
    {
      "name":"somefieldname",
      "type":"lowercase",
      "indexed":true,
      "stored":true},
    {
      "name":"title",
      "type":"strings"
    }
  ]
}


View a specific field in the schema:

http://localhost:8983/solr/yourcollectionname/schema/fields/somefieldname

Example output:

{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "field":{
    "name":"somefieldname",
    "type":"lowercase",
    "indexed":false,
    "stored":true
  }
}

Now add a new field called "anotherfield" that is of type "text_en", stored, and indexed:

curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field":{"name":"anotherfield","type":"text_en","stored":true,"indexed":true }}' http://localhost:8983/solr/yourcollectionname/schema


Now let's see that the field exists:

http://localhost:8983/solr/yourcollectionname/schema/fields/anotherfield

{
  "responseHeader":
  {
    "status":0,
    "QTime":1
  },
  "field":
  {
    "name":"anotherfield",
    "type":"text_en",
    "indexed":true,
    "stored":true
    }
}

Now let's delete the field:

curl -X POST -H 'Content-type:application/json' --data-binary '{"delete-field" : { "name":"anotherfield" }}' http://localhost:8983/solr/yourcollectionname/schema

And check to see that it is deleted:

http://localhost:8983/solr/yourcollectionname/schema/fields/anotherfield

{
  "responseHeader":
  {
    "status":404,
    "QTime":2
  },
  "error":
  {
    "msg":"Field 'anotherfield' not found.",
    "code":404
  }
}

There are other actions that you can do using the Schema API. Here are a few of the things that you can do using the Schema API:
- replace a field
- add and remove dynamic field patterns
- view dynamic fields
- and and remove field types
- view field types 

Wednesday, August 12, 2015

Kill process using a specific port...

I've been debugging an app that will sometimes get into an unresponsive state. The app is using the play framework, and I've been starting activator with the -jvm-debug option (and using port 9999). Sometimes when I try to terminate the activator terminal the CTRL+C will be ignored. I used lsof -i :9999 to find the pid to kill so that I can make changes and restart activator.

A quick search led me to this posting on askubuntu.com, and it shows that you can kill a list of pids returned by lsof by using this form:

kill -9 $(lsof -ti :9999)

The "t" tells lsof to print terse info, and the info that is printed is just the pid.

Tuesday, August 11, 2015

Check for processes across multiple machines...

My group at work needed to know if certain processes were running on a set of machines, and we didn't want to have to manually discover that the processes were running or not. The processes are running on Linux machines, and we use Splunk for capturing environment data, so I created a simple script to do the check and write it to a text file that Splunk consumes.

The minor issues I had were:

1. I needed to run the script as a specific user due to using ssh as that user to other machines.

Solution: Create a crontab entry for that user.  ie, crontab -u myuser -e

Then add the crontab entry:

*/15 * * * * /opt/myscripts/checkForProcesses.sh

2. I wanted to check for multiple processes but not all in one command string. I created separate "check" strings for each unique process. The issue I had was that the check string was interpreted as mutliple variables. "ps x | grep stuff | grep -v grep" was treated as "ps", "x", etc.

Solution: I passed the check string in as the last variable to the method. If the check string was the 3rd value being passed in, then I used the value like this: ${@:3}

The @ mean get all values, and the 3 says to start at value 3.


Here is a short version of the script with all details stripped out:

#!/bin/sh

SCRIPT_DIR=$(dirname $0)
LOG_DIR=/opt/logs
STATUS_FILE=/opt/logs/status/process_status.txt

PROC1_CHECK="ps x | grep myproc1 | grep -v grep | grep -v less | grep java"
PROC2_CHECK="ps x | grep myproc2 | grep -v grep | grep -v less | grep java"
WORKER_LIST=$(cat $SCRIPT_DIR/worker.list)

# get the date/time for Splunk
DATE_VAL=`date`

rm -f $STATUS_FILE

checkStatus()
{
    echo "Checking $1 for $2."
    STATUS=`ssh myuser@$1 ${@:3}`
    if [ "" == "$STATUS" ];
    then
        echo "$DATE_VAL : ERROR : $2 not running on $1." >> $STATUS_FILE
    else
        echo "$DATE_VAL : STATUS : $2 running on $1." >> $STATUS_FILE
    fi

}

#worker.list is a set of machine names to check for certain processes
for worker in $WORKER_LIST; do
    echo "Worker being checked is $worker"
    checkStatus $worker "MyProc1" $PROC1_CHECK
    checkStatus $worker "MyProc2" $PROC2_CHECK
done

Wednesday, August 5, 2015

Shell script fun - pull, compile, and display errors...

I am working on a project at work that has multiple project dependencies, and people from other time zones constantly updating those projects, so it is helpful to stay in sync so I can avoid merge issues.

I wrote a shell script to pull and build the projects. We are using maven for our projects, so a nice side effect is that the maven build outputs "SUCCESS" or "FAILURE" for the project build. I just grep for "FAILURE" (or "error"), pipe the result to a text file, and then check each text file to see if it is 0 bytes or not.  If the file has data, then I print a build error message, cat the file, and then delete it. I output the failed file in red if the build failed.

Here is a simple version of the script:

#!/bin/sh
source ~/.bash_profile

export WORK_HOME=~/dev/source

red=`tput setaf 1`
green=`tput setaf 2`
reset=`tput sgr0`
#echo "${red}red text ${green}green text${reset}"

grep_file()
{
  if [ -s "$WORK_HOME/$2" ]
  then
    echo "${red}******** $1 had errors!${reset}"
    echo ""
    echo "${red}"
    cat $WORK_HOME/$2
    echo "${reset}"
  else
    rm -f $WORK_HOME/$2
  fi
}

do_build()
{
  echo ""
  echo "${green}Updating $2.${reset}"
  echo "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-="
  cd $WORK_HOME/$2
  git pull
  if [ "$1" == "mci" ]
  then
    mvn clean install 2>&1 | grep FAILURE | grep -iv failures | grep -v grep > $WORK_HOME/$3
  else
    mvn clean package 2>&1 | grep FAILURE | grep -iv failures | grep -v grep > $WORK_HOME/$3
  fi
  echo "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-="
}

do_activator_build()
{
  echo ""
  echo "${green}Running activator build for $1.${reset}"
  echo "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-="
  cd $WORK_HOME/$1
  git pull
  activator update
  activator compile | grep -i error | grep -v grep > $WORK_HOME/$2
  echo "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-="
}

# project 1
do_build "mci" "project1" "proj1_failed.txt"

# project 2
do_activator_build "project2" "proj2_failed.txt"

echo ""
echo "Checking for failures..."
echo ""

grep_file "project1" "proj1_failed.txt"
grep_file "project2" "proj2_failed.txt"
echo "Done."


Here is the output:


Updating project1.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Already up-to-date.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Running activator build for project2.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Already up-to-date.
[info] Loading project definition from /Users/user/dev/source/project2
[info] Updating {file:/Users/user/dev/source/project2/}project2...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[success] Total time: 16 s, completed Aug 5, 2015 11:06:25 PM
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Checking for failures...

Done.

Thursday, July 23, 2015

One line shell command to kill processes that match a search string...

I had a bunch of java processes running on a server, and I needed to kill all of them. It's easy enough to write a shell script to kill them, but I wanted to do it all in one command.  It is common enough to do, so I did a search on stackoverflow and found what I needed:

for pid in `ps x | grep java | grep -v grep | awk '{print $1}'` ; do kill -9 $pid ; done

The key is the command in the back ticks.  The shell will execute that first, and use the results in the for loop - reading each result into the variable $pid.

Sunday, June 14, 2015

Simple Neo4J example...

I've been using a graph database at work named Titan (https://github.com/thinkaurelius/titan). It's open source, and has some pretty nice features such as the ability to choose between a number of different databases for storage (Cassandra DB, HBase, Berkeley DB, etc), and use Solr or Elastic Search for external indexing of data. However, I wanted to try out Neo4J for home project. Here are a few of the things that I learned.


1. Neo4J is incredibly easy to add to your project using Maven. Just add a dependency like this:

<dependency>
    <groupId>org.neo4j</groupId>
    <artifactId>neo4j</artifactId>
    <version>2.2.2</version>
</dependency>

2. I used Neo4J as an embedded database like this:

GraphDatabaseService graphDb;
graphDb = new GraphDatabaseFactory().newEmbeddedDatabase( DB_PATH );

3. The Neo4J documentation recommended that you use try with resource:

try (Transaction tx = graphDb.beginTx()) {
  populateDb(graphDb, someDataToAdd);
  tx.success();
} catch (Exception e) {
  e.printStackTrace();
}

4. Adding nodes and relationships is very straightforward:

// Add nodes using createNode, then 
// use setProperty for each property
someDataToAdd.forEach(d -> {
  Node node = graphDb.createNode();
  node.setProperty("name", d.getName());
  node.setProperty("email", d.getEmail());
  node.setProperty("id", d.getId());
  node.addLabel(DynamicLabel.label("Person"));
});
// Add relationships by calling createRelationshipTo 
// on the source node.
// Also, search for nodes by calling findNodes
Node friend = graphDb.findNodes(
DynamicLabel.label("Person"), "id", id).next();
node.createRelationshipTo(friend, RelTypes.FRIEND);

5. Get all relationships and nodes by using the GlobalGraphOperations:

GlobalGraphOperations.at(graphDb)
.getAllRelationships().forEach(n -> n.delete());
GlobalGraphOperations.at(graphDb)
.getAllNodes().forEach(n -> n.delete());

It's possible to query the graph using cypher - I just found it to be easier and more intuitive to use the Java API.


Monday, April 20, 2015

Vagrant + VirtualBox = Awesome

Have you ever wanted to create a test environment for some some code, but you didn't have the time or patience to create VMWare or VirtualBox instances for your target OS? Let Vagrant do the tedious work for you!

I first used vagrant when going through Udacity's Apache Storm course. The course has you install git bash, vagrant, and virtual box. It then has you clone a repo that includes a vagrantfile that points to their vagrant box for the course. Running "vagrant up" in the directory with the vagrantfile will cause vagrant to download the vagrant box (or image). You can run "vagrant up" again after the box is downloaded, and it will start the box using Virtual Box. You can "connect" to the box using "vagrant ssh".

I used to use AWS to spin up instances for testing code, but that can be kind of expensive. Now I can get similar virtual machines in about the same amount of time, but for freeeeeeee!!!! 

The Vagrant website's "Getting Started" guide has you downloading and running boxes right away. The guide also points you to the site where you can discover other vagrant boxes that are free to download and use. 

I've seen references to vagrant numerous times, but never bothered to check it out before going through the Udacity course. I can't wait to see what I can hit with my vagrant hammer!

Saturday, March 21, 2015

Using CursorMark for deep paging in Solr

Have you ever had to search through 100 million XML blobs to identify which XML blobs had some specific data (or, in my case, was missing some data)? It takes forever, and it's not very fun. I had a situation at work where it looked like I was going to have to do just that. Fortunately, we have the same data (more or less) in Solr cores. I just needed to do a NOT query on a specific field.

The Solr documentation suggests that you use a CursorMark parameter in your query if you need to page through more than a thousand records. 

The following is an example of using CursorMark using SolrJ. The query will return all documents that do not have data for targetField. It then loops through the matches using the cursorMark parameter as a way to tell solr where to retrieve the next set of matches. One of the conditions of using cursorMark to page through results is that you need to sort on a unique field. The schema we are using has a unique field named "uid". 

Something to note is that I had to explicitly set "timeAllowed" to 0 to say that I didn't want any time restrictions on the query.

Yonik Seeley made an interesting point in his Solr 'n Stuff blog about using cursorMark. You can change the returned fields, the facet fields, and number of rows returned, for a specific cursorMark since the cursorMark contains the state of the current search - the state isn't stored server-side. This makes it very easy for a client to allow a user to vary their experience while they page through results. This feels very similar to how you page through results using MySQL and other databases.

import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrDocumentList;
import org.apache.solr.common.params.CursorMarkParams;
import java.io.*;
import java.util.zip.ZipEntry;
import java.util.zip.ZipOutputStream;

public class FindMissingData {

  private static final int MAX_ALLOWED_ROWS=1000;
  private static final String URL_PREFIX="http://somesolrserver:8983/solr/thecore";
  private static final String ZIP_ENTRY_NAME="TheData.txt";
  private static final String UNIQUE_ID_FIELD="uid";
  private static final String SOLR_QUERY="NOT targetField:[* TO *]";

  public static void main(String[] args) {

    if (args.length != 2) {
      System.out.println("java FindMissingData <rows per batch - max is 1000> <output zip file>");
    }

    int maxRows = Integer.parseInt(args[0]);
    String outputFile = args[1];

    // Delete zip file if it already exists - going to recreate it anyway
    File f = new File(outputFile);
    if(f.exists() && !f.isDirectory()) {
      f.delete();
    }

    FindMissingData.writeIdsForMissingData(outputFile, maxRows);
  }

  public static void writeIdsForMissingData(String outputFile, int maxRowCount) {

    if (maxRowCount > MAX_ALLOWED_ROWS) 
      maxRowCount = MAX_ALLOWED_ROWS;

    FileOutputStream fos = null;
    ZipOutputStream zos = null;

    try {
      fos = new FileOutputStream(outputFile, true);
      zos = new ZipOutputStream(fos);
      zos.setLevel(9);

      queryForMissingData(maxRowCount, zos);
    } catch (Exception e) {
      e.printStackTrace();
    } finally {
      if (zos != null) {
        try {
          zos.flush();
          zos.close();
        } catch (Exception e) {}
      }
      if (fos != null) {
        try {
          fos.flush();
          fos.close();
        } catch (Exception e) {}
      }
    }
  }

  private static void queryForMissingData(int maxRowCount, ZipOutputStream zos) throws IOException {
    ZipEntry zipEntry = new ZipEntry(ZIP_ENTRY_NAME);
    zos.putNextEntry(zipEntry);


    SolrServer server = new HttpSolrServer(URL_PREFIX);

    SolrQuery q = new SolrQuery(SOLR_QUERY);
    q.setFields(UNIQUE_ID_FIELD);
    q.setRows(maxRowCount);
    q.setSort(SolrQuery.SortClause.desc(UNIQUE_ID_FIELD));

    // You can't use "TimeAllowed" with "CursorMark"
    // The documentation says "Values <= 0 mean 
    // no time restriction", so setting to 0.
    q.setTimeAllowed(0);

    String cursorMark = CursorMarkParams.CURSOR_MARK_START;
    boolean done = false;
    QueryResponse rsp = null;
       
    while (!done) {
      q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
      try {
        rsp = server.query(q);
      } catch (SolrServerException e) {
        e.printStackTrace();
        return;
      }

      writeOutUniqueIds(rsp, zos);

      String nextCursorMark = rsp.getNextCursorMark();
      if (cursorMark.equals(nextCursorMark)) {
        done = true;
      } else {
        cursorMark = nextCursorMark;
      }
    }
  }

  private static void writeOutUniqueIds(QueryResponse rsp, ZipOutputStream zos) throws IOException {
    SolrDocumentList docs = rsp.getResults();

    for(SolrDocument doc : docs) {
      zos.write(
        String.format("%s%n",
          doc.get("uid").toString()).getBytes()
      );
    }
  }
}

Tuesday, March 3, 2015

Using Mockito to help test retry code...

There are times when you want to make sure that your code can retry an operation that might occaisionally fail due to environment issues.  For example, while using AWS SDK to get an object from S3 you might sometimes get an AmazonClientException - perhaps because of an endpoint being temporarily unreachable. You can simulate this situation using Mockito for the AmazonS3Client.

Mockito provides a fluent style way of how your mock objects respond.  Here is an example:

    AmazonS3Client mockClient = mock(AmazonS3Client.class);
    S3Object mockObject = mock(S3Object.class);

    when(mockClient.getObject(anyString(), anyString()))
        .thenThrow(new AmazonClientException("Something bad happened."))
        .thenReturn(mockObject);

This will cause the code to throw an AmazonClientException on the first call to getObject(), but then return the mocked S3Object on the second call to getObject().

This means that you can have happy path tests where the call eventually succeeds, and a test that will show how your code handles all retries failing, by just setting up your mock with the desired configuration of "thenThrow" and "thenReturn" entries.