Thursday, August 27, 2015

Simple gremlin queries in Titan...

Titan is an open source graph database and, even though it isn't as easy to setup for use as Neo4J, it is easy enough to start using in just a few minutes. Here is an example of using Titan with HBase as the backend storage.

Setup

I'm using HBase 0.98.13. I downloaded hbase, untarred the files, changed to the hbase directory and ran "bin/start-hbase.sh".

I cloned Titan from the github repository and built it using the maven package command.

git clone https://github.com/thinkaurelius/titan.git
cd titan
mvn clean package

I started gremlin using "bin/gremlin.sh".

I followed the Titan Hbase instructions for initializing Titan for use with HBase within the gremlin console. You can also create a properties file that contains the same settings and load the settings from the gremlin console.

Gremlin

conf = new BaseConfiguration();
conf.setProperty("storage.backend","hbase");
conf.setProperty("storage.hbase.table", "test") 

g = TitanFactory.open(conf);

Now let's add some vertices.

alice = g.addVertexWithLabel('human')
alice.setProperty('name', 'alice')
alice.setProperty('age',25)
bob = g.addVertexWithLabel('human')
bob.setProperty('name', 'bob')
bob.setProperty('age',21)
clark = g.addVertexWithLabel('human')
clark.setProperty('name', 'clark')
clark.setProperty('age',93)
darwin = g.addVertexWithLabel('human')
darwin.setProperty('name', 'darwin')
darwin.setProperty('age',206)
ernie = g.addVertexWithLabel('android')
ernie.setProperty('name', 'ernie')


Let's list the vertices and their properties.

g.V().map()
==>{name=ernie}
==>{name=alice, age=25}
==>{name=darwin, age=206}
==>{name=clark, age=93}
==>{name=bob, age=21}

And now let's make these humans be friends with each other.

alice.addEdge('friend', bob)
alice.addEdge('friend', darwin)
bob.addEdge('friend', alice)
bob.addEdge('friend', darwin)
clark.addEdge('friend', darwin)
darwin.addEdge('friend',alice)
darwin.addEdge('friend', bob)
darwin.addEdge('friend', clark)



Now let's remove ernie from the graph.

g.V.has('name', 'ernie').remove()
==>null

Now we can see that ernie is gone

g.V.has('name', 'ernie').map()

(no results displayed, just the gremlin prompt)

Let's add ernie back, but this time he's a human.

ernie = g.addVertexWithLabel('human')
ernie.setProperty('name', 'ernie')



Let's try finding out who has friends

g.V().outE('friend').outV().name
==>darwin
==>darwin
==>darwin
==>alice
==>alice
==>bob
==>bob
==>clark


Wait - what happened? We see an entry for every friend edge, which is exactly what our gremlin query was asking for, but that doesn't look very nice.

Let's try the dedup method.

g.V().outE('friend').outV().dedup().name
==>darwin
==>alice
==>bob
==>clark


Ahh! That's more like it! But how else can we get that list?

g.V.filter{it.outE('friend').hasNext()}.toList()._().name
==>darwin
==>alice
==>bob
==>clark


Nice! We have two ways to get a distinct list.

Friday, August 14, 2015

Add and remove fields from Solr schema using Schema API...

Okay - I know what you're thinking. "How can I quickly update my Solr schema without having to go into the schema config file - perhaps using a rest API?" You can use the Solr Schema API, that's how! 

I recently noticed that there is a Schema API in Solr 5.X that can be used to update the Solr schema. You need to have the schemaFactory set to use the "ManagedIndexSchemaFactory", and have the mutable property set to true.  If you want to stop allowing the schema from being updated via the API, then you can change the mutable property to false.

Here are a few of the things that you can do with the schema API:

View the schema for a collection:

http://localhost:8983/solr/yourcollectionname/schema

View all of the fields in the schema:

http://localhost:8983/solr/yourcollectionname/schema/fields

Example output:

{
  "responseHeader":{
    "status":0,
    "QTime":101
  },
  "fields":[
  {
      "name":"_text_",
      "type":"text_general",
      "multiValued":true,
      "indexed":true,
      "stored":false},
    {
      "name":"_version_",
      "type":"long",
      "indexed":true,
      "stored":true},
    {
      "name":"id",
      "type":"string",
      "multiValued":false,
      "indexed":true,
      "required":true,
      "stored":true,
      "uniqueKey":true},
    {
      "name":"somefieldname",
      "type":"lowercase",
      "indexed":true,
      "stored":true},
    {
      "name":"title",
      "type":"strings"
    }
  ]
}


View a specific field in the schema:

http://localhost:8983/solr/yourcollectionname/schema/fields/somefieldname

Example output:

{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "field":{
    "name":"somefieldname",
    "type":"lowercase",
    "indexed":false,
    "stored":true
  }
}

Now add a new field called "anotherfield" that is of type "text_en", stored, and indexed:

curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field":{"name":"anotherfield","type":"text_en","stored":true,"indexed":true }}' http://localhost:8983/solr/yourcollectionname/schema


Now let's see that the field exists:

http://localhost:8983/solr/yourcollectionname/schema/fields/anotherfield

{
  "responseHeader":
  {
    "status":0,
    "QTime":1
  },
  "field":
  {
    "name":"anotherfield",
    "type":"text_en",
    "indexed":true,
    "stored":true
    }
}

Now let's delete the field:

curl -X POST -H 'Content-type:application/json' --data-binary '{"delete-field" : { "name":"anotherfield" }}' http://localhost:8983/solr/yourcollectionname/schema

And check to see that it is deleted:

http://localhost:8983/solr/yourcollectionname/schema/fields/anotherfield

{
  "responseHeader":
  {
    "status":404,
    "QTime":2
  },
  "error":
  {
    "msg":"Field 'anotherfield' not found.",
    "code":404
  }
}

There are other actions that you can do using the Schema API. Here are a few of the things that you can do using the Schema API:
- replace a field
- add and remove dynamic field patterns
- view dynamic fields
- and and remove field types
- view field types 

Wednesday, August 12, 2015

Kill process using a specific port...

I've been debugging an app that will sometimes get into an unresponsive state. The app is using the play framework, and I've been starting activator with the -jvm-debug option (and using port 9999). Sometimes when I try to terminate the activator terminal the CTRL+C will be ignored. I used lsof -i :9999 to find the pid to kill so that I can make changes and restart activator.

A quick search led me to this posting on askubuntu.com, and it shows that you can kill a list of pids returned by lsof by using this form:

kill -9 $(lsof -ti :9999)

The "t" tells lsof to print terse info, and the info that is printed is just the pid.

Tuesday, August 11, 2015

Check for processes across multiple machines...

My group at work needed to know if certain processes were running on a set of machines, and we didn't want to have to manually discover that the processes were running or not. The processes are running on Linux machines, and we use Splunk for capturing environment data, so I created a simple script to do the check and write it to a text file that Splunk consumes.

The minor issues I had were:

1. I needed to run the script as a specific user due to using ssh as that user to other machines.

Solution: Create a crontab entry for that user.  ie, crontab -u myuser -e

Then add the crontab entry:

*/15 * * * * /opt/myscripts/checkForProcesses.sh

2. I wanted to check for multiple processes but not all in one command string. I created separate "check" strings for each unique process. The issue I had was that the check string was interpreted as mutliple variables. "ps x | grep stuff | grep -v grep" was treated as "ps", "x", etc.

Solution: I passed the check string in as the last variable to the method. If the check string was the 3rd value being passed in, then I used the value like this: ${@:3}

The @ mean get all values, and the 3 says to start at value 3.


Here is a short version of the script with all details stripped out:

#!/bin/sh

SCRIPT_DIR=$(dirname $0)
LOG_DIR=/opt/logs
STATUS_FILE=/opt/logs/status/process_status.txt

PROC1_CHECK="ps x | grep myproc1 | grep -v grep | grep -v less | grep java"
PROC2_CHECK="ps x | grep myproc2 | grep -v grep | grep -v less | grep java"
WORKER_LIST=$(cat $SCRIPT_DIR/worker.list)

# get the date/time for Splunk
DATE_VAL=`date`

rm -f $STATUS_FILE

checkStatus()
{
    echo "Checking $1 for $2."
    STATUS=`ssh myuser@$1 ${@:3}`
    if [ "" == "$STATUS" ];
    then
        echo "$DATE_VAL : ERROR : $2 not running on $1." >> $STATUS_FILE
    else
        echo "$DATE_VAL : STATUS : $2 running on $1." >> $STATUS_FILE
    fi

}

#worker.list is a set of machine names to check for certain processes
for worker in $WORKER_LIST; do
    echo "Worker being checked is $worker"
    checkStatus $worker "MyProc1" $PROC1_CHECK
    checkStatus $worker "MyProc2" $PROC2_CHECK
done

Wednesday, August 5, 2015

Shell script fun - pull, compile, and display errors...

I am working on a project at work that has multiple project dependencies, and people from other time zones constantly updating those projects, so it is helpful to stay in sync so I can avoid merge issues.

I wrote a shell script to pull and build the projects. We are using maven for our projects, so a nice side effect is that the maven build outputs "SUCCESS" or "FAILURE" for the project build. I just grep for "FAILURE" (or "error"), pipe the result to a text file, and then check each text file to see if it is 0 bytes or not.  If the file has data, then I print a build error message, cat the file, and then delete it. I output the failed file in red if the build failed.

Here is a simple version of the script:

#!/bin/sh
source ~/.bash_profile

export WORK_HOME=~/dev/source

red=`tput setaf 1`
green=`tput setaf 2`
reset=`tput sgr0`
#echo "${red}red text ${green}green text${reset}"

grep_file()
{
  if [ -s "$WORK_HOME/$2" ]
  then
    echo "${red}******** $1 had errors!${reset}"
    echo ""
    echo "${red}"
    cat $WORK_HOME/$2
    echo "${reset}"
  else
    rm -f $WORK_HOME/$2
  fi
}

do_build()
{
  echo ""
  echo "${green}Updating $2.${reset}"
  echo "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-="
  cd $WORK_HOME/$2
  git pull
  if [ "$1" == "mci" ]
  then
    mvn clean install 2>&1 | grep FAILURE | grep -iv failures | grep -v grep > $WORK_HOME/$3
  else
    mvn clean package 2>&1 | grep FAILURE | grep -iv failures | grep -v grep > $WORK_HOME/$3
  fi
  echo "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-="
}

do_activator_build()
{
  echo ""
  echo "${green}Running activator build for $1.${reset}"
  echo "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-="
  cd $WORK_HOME/$1
  git pull
  activator update
  activator compile | grep -i error | grep -v grep > $WORK_HOME/$2
  echo "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-="
}

# project 1
do_build "mci" "project1" "proj1_failed.txt"

# project 2
do_activator_build "project2" "proj2_failed.txt"

echo ""
echo "Checking for failures..."
echo ""

grep_file "project1" "proj1_failed.txt"
grep_file "project2" "proj2_failed.txt"
echo "Done."


Here is the output:


Updating project1.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Already up-to-date.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Running activator build for project2.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Already up-to-date.
[info] Loading project definition from /Users/user/dev/source/project2
[info] Updating {file:/Users/user/dev/source/project2/}project2...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[success] Total time: 16 s, completed Aug 5, 2015 11:06:25 PM
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Checking for failures...

Done.