Thursday, May 25, 2017

Thursday, May 18, 2017

Refactoring Java code using lambdas - Stream.filter

Refactoring Java Code Using Lambdas

I’ve been going through a book, and absolutely loving it - Java 8 in Action: Lambdas, Streams, and functional-style programming.

The thing that I’ve enjoyed the most so far are the notes on when to use certain methods. In the chapter on streams, chapter 3, there is one particular method that really stuck out for me - the filter method. 

The filter method takes a predicate, and returns a stream of all the elements that match the predicate.

The book points out the following:

“Any time you’re looping over some data and checking each element, you might want to think about using the new filter method on Stream.”

Here is an example - the following code will print out even numbers:

List<Integer> numbers = Arrays.asList(1,2,3,4,5,6,7,8);

for (int num : numbers) {
    if (num % 2 == 0) {
       System.out.println(num);
    }
 }

The code above is pretty straight forward, but it could be written like this instead:

List<Integer> numbers = Arrays.asList(1,2,3,4,5,6,7,8);

numbers.stream()
   .filter(n -> n % 2 == 0)
   .forEach(System.out::println);

This might not look like a huge advantage, because there isn’t a lot different between the two. The amount of code is basically the same as well.  However, the benefit of the lambda version is that it can be chained together with other Stream methods.

For example, imagine that you have some data containing user IDs, and some user IDs have a special prefix to indicate a special user type.  You might want to filter out the special users, and then return a list of users without the prefix and with the name in upper case letters.

List<String> userIds = Arrays.asList(“*alice”,”bob”,”*", "charlie","*dana","evelyn","*frank");

return userIds.stream()
   .filter(u -> u.startsWith(“*”) && u.length() > 1)
   .map(u -> u.substring(1).toUpperCase())
   .collect(Collectors.toList());

To do the same thing without using streams would look something like this:

List<String> userIds = Arrays.asList("*alice","bob","", "*", "charlie","*dana","evelyn","*frank");

List<String> specialUsers = new ArrayList<>();

for (String user : userIds) {
   if (user.startsWith("*") && user.length() > 1) {
      specialUsers.add(user.substring(1).toUpperCase());
   }
}

return specialUsers;

You can see that it would require that another variable would have to be declared to hold the special users. Kind of a waste. 


I’ll definitely be keeping my eyes open for code that is iterating over lists and inspecting each element!



Monday, March 7, 2016

Cleaning up after aws cli on Mac OSX...

I've installed the aws command line on my Mac. It's super handy. However, the aws s3 command creates $folder$ files for every "directory" when a recursive copy is performed. It's super annoying. 

For example, you could have a "directory" in S3 named "myfiles". When you download the objects with "myfiles" in the path you will end up with a file named "myfiles_$folder$".

Running aws --version returns this info:

    aws-cli/1.10.6 Python/2.7.10 Darwin/14.5.0 botocore/1.3.28


I haven't found anything that explains how I can prevent those files from being created, so I've been doing manual cleanup afterwards.  This is the command I run:

    > rm $(find . "*$folder$")



Tuesday, January 12, 2016

Debugging a local Spark job using IntelliJ

A coworker was working on a local Spark job and shared how he set up his environment for debugging the job (which is basically the same as debugging any other remote process). These are the instructions I followed:

1. Create a remote debug configurations.


Go to IntelliJ's "Run | Edit Configurations" screen
Click on the "+" to "Add New Configuration"
Select "Remote"


2. Copy the command line argument to use and modify it however you see fit.


I'm using Java 8, so I used the example command line arguments from the top edit box. The only change I made was to set "suspend=y" so the spark job would stop and wait for me to start my "Remote Debug" process.

This is what I used: 
-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

3. Export the command line arg as SPARK_JAVA_OPTS (Spark uses this value when you submit a spark job).

I set the SPARK_JAVA_OPTS like this:

export SPARK_JAVA_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

4. Start the spark job.

You should see your spark job start up, and then pause with the following line printed on the console:

Listening for transport dt_scoket at address: 5005

5. In IntelliJ, create whatever breakpoints you want to use and start the remote debug configuration. 


Thursday, August 27, 2015

Simple gremlin queries in Titan...

Titan is an open source graph database and, even though it isn't as easy to setup for use as Neo4J, it is easy enough to start using in just a few minutes. Here is an example of using Titan with HBase as the backend storage.

Setup

I'm using HBase 0.98.13. I downloaded hbase, untarred the files, changed to the hbase directory and ran "bin/start-hbase.sh".

I cloned Titan from the github repository and built it using the maven package command.

git clone https://github.com/thinkaurelius/titan.git
cd titan
mvn clean package

I started gremlin using "bin/gremlin.sh".

I followed the Titan Hbase instructions for initializing Titan for use with HBase within the gremlin console. You can also create a properties file that contains the same settings and load the settings from the gremlin console.

Gremlin

conf = new BaseConfiguration();
conf.setProperty("storage.backend","hbase");
conf.setProperty("storage.hbase.table", "test") 

g = TitanFactory.open(conf);

Now let's add some vertices.

alice = g.addVertexWithLabel('human')
alice.setProperty('name', 'alice')
alice.setProperty('age',25)
bob = g.addVertexWithLabel('human')
bob.setProperty('name', 'bob')
bob.setProperty('age',21)
clark = g.addVertexWithLabel('human')
clark.setProperty('name', 'clark')
clark.setProperty('age',93)
darwin = g.addVertexWithLabel('human')
darwin.setProperty('name', 'darwin')
darwin.setProperty('age',206)
ernie = g.addVertexWithLabel('android')
ernie.setProperty('name', 'ernie')


Let's list the vertices and their properties.

g.V().map()
==>{name=ernie}
==>{name=alice, age=25}
==>{name=darwin, age=206}
==>{name=clark, age=93}
==>{name=bob, age=21}

And now let's make these humans be friends with each other.

alice.addEdge('friend', bob)
alice.addEdge('friend', darwin)
bob.addEdge('friend', alice)
bob.addEdge('friend', darwin)
clark.addEdge('friend', darwin)
darwin.addEdge('friend',alice)
darwin.addEdge('friend', bob)
darwin.addEdge('friend', clark)



Now let's remove ernie from the graph.

g.V.has('name', 'ernie').remove()
==>null

Now we can see that ernie is gone

g.V.has('name', 'ernie').map()

(no results displayed, just the gremlin prompt)

Let's add ernie back, but this time he's a human.

ernie = g.addVertexWithLabel('human')
ernie.setProperty('name', 'ernie')



Let's try finding out who has friends

g.V().outE('friend').outV().name
==>darwin
==>darwin
==>darwin
==>alice
==>alice
==>bob
==>bob
==>clark


Wait - what happened? We see an entry for every friend edge, which is exactly what our gremlin query was asking for, but that doesn't look very nice.

Let's try the dedup method.

g.V().outE('friend').outV().dedup().name
==>darwin
==>alice
==>bob
==>clark


Ahh! That's more like it! But how else can we get that list?

g.V.filter{it.outE('friend').hasNext()}.toList()._().name
==>darwin
==>alice
==>bob
==>clark


Nice! We have two ways to get a distinct list.

Friday, August 14, 2015

Add and remove fields from Solr schema using Schema API...

Okay - I know what you're thinking. "How can I quickly update my Solr schema without having to go into the schema config file - perhaps using a rest API?" You can use the Solr Schema API, that's how! 

I recently noticed that there is a Schema API in Solr 5.X that can be used to update the Solr schema. You need to have the schemaFactory set to use the "ManagedIndexSchemaFactory", and have the mutable property set to true.  If you want to stop allowing the schema from being updated via the API, then you can change the mutable property to false.

Here are a few of the things that you can do with the schema API:

View the schema for a collection:

http://localhost:8983/solr/yourcollectionname/schema

View all of the fields in the schema:

http://localhost:8983/solr/yourcollectionname/schema/fields

Example output:

{
  "responseHeader":{
    "status":0,
    "QTime":101
  },
  "fields":[
  {
      "name":"_text_",
      "type":"text_general",
      "multiValued":true,
      "indexed":true,
      "stored":false},
    {
      "name":"_version_",
      "type":"long",
      "indexed":true,
      "stored":true},
    {
      "name":"id",
      "type":"string",
      "multiValued":false,
      "indexed":true,
      "required":true,
      "stored":true,
      "uniqueKey":true},
    {
      "name":"somefieldname",
      "type":"lowercase",
      "indexed":true,
      "stored":true},
    {
      "name":"title",
      "type":"strings"
    }
  ]
}


View a specific field in the schema:

http://localhost:8983/solr/yourcollectionname/schema/fields/somefieldname

Example output:

{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "field":{
    "name":"somefieldname",
    "type":"lowercase",
    "indexed":false,
    "stored":true
  }
}

Now add a new field called "anotherfield" that is of type "text_en", stored, and indexed:

curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field":{"name":"anotherfield","type":"text_en","stored":true,"indexed":true }}' http://localhost:8983/solr/yourcollectionname/schema


Now let's see that the field exists:

http://localhost:8983/solr/yourcollectionname/schema/fields/anotherfield

{
  "responseHeader":
  {
    "status":0,
    "QTime":1
  },
  "field":
  {
    "name":"anotherfield",
    "type":"text_en",
    "indexed":true,
    "stored":true
    }
}

Now let's delete the field:

curl -X POST -H 'Content-type:application/json' --data-binary '{"delete-field" : { "name":"anotherfield" }}' http://localhost:8983/solr/yourcollectionname/schema

And check to see that it is deleted:

http://localhost:8983/solr/yourcollectionname/schema/fields/anotherfield

{
  "responseHeader":
  {
    "status":404,
    "QTime":2
  },
  "error":
  {
    "msg":"Field 'anotherfield' not found.",
    "code":404
  }
}

There are other actions that you can do using the Schema API. Here are a few of the things that you can do using the Schema API:
- replace a field
- add and remove dynamic field patterns
- view dynamic fields
- and and remove field types
- view field types 

Wednesday, August 12, 2015

Kill process using a specific port...

I've been debugging an app that will sometimes get into an unresponsive state. The app is using the play framework, and I've been starting activator with the -jvm-debug option (and using port 9999). Sometimes when I try to terminate the activator terminal the CTRL+C will be ignored. I used lsof -i :9999 to find the pid to kill so that I can make changes and restart activator.

A quick search led me to this posting on askubuntu.com, and it shows that you can kill a list of pids returned by lsof by using this form:

kill -9 $(lsof -ti :9999)

The "t" tells lsof to print terse info, and the info that is printed is just the pid.