Stand Back, I'm Attempting Science

Blog on software engineering and emerging technologies

Groovy on Storm

At work, we end up using Groovy. A lot. In fact, most of our infrastructure is built using some combination of Groovy, Java, and a hint of Clojure. When Nathan Marz's Storm came out we were understandably excited: forcing Hadoop to attempt to be a real time system was an absolute nightmare. Don't try that. It's not smart. Hadoop is a batch system.

Media_http28mediatumb_curuj

Get out of there cat. You are not a data-centric batch processing solution.

We rearchitected, clearly. Hadoop is being used in the way its creator intended: bulk processing, and we're playing with Storm as a means of going forward with other cool projects such as moving our realtime backend processes. This includes webhook processing and a lot more.

The first step is getting Groovy to run on Storm. In fit of sleeplessness, I created a shell project and corresponding pom.xml to get everything going.

https://github.com/xorlev/groovy-storm

A very basic project with a slightly modified ExclamationTopology (no inner classes or semicolons) and a working pom.xml which uses antrun to compile the Groovy code into an executable jar. After all is said and done, one need only run java -jar target/groovy-storm-0.1.0-jar-with-dependencies.jar

Pretty simple, now I can get started on the hard work and use the storm-contrib packages to feed from SQS (our primary message queue).

 

Filed under  //   distributed computing   programming   storm  

Data Platforms & ActiveRecord

Working on a data platform for my 9th and final data mining project on the KDD Cup dataset. Transforming four text files of data and relations into a real, queryable dataset. In practice, I'm creating a data warehouse, albeit extraordinarily simplified.

Data transformation is almost never fun. It's rote work, it's dry, and time consuming. Today I decided to move away from my general MO for one-off data transformers and not embed SQL directly into the project.

Instead I decided to work with ActiveRecord. ActiveRecord is a wonderful Object-Relational Mapper pattern in use by many different libraries in tons of languages. Ruby, the general language of my choice in this course (and one of my favorite languages in general, I love Ruby) has a great ActiveRecord package in use by Ruby on Rails.

It was surprisingly easy to implement my own models and begin using them with ActiveRecord outside of Rails.

The KDD Cup dataset defines a related set of Tracks, Genres, Albums, and Artists. All very routine for a relational database. My relational DB of choice being SQLite, given the lack of setup required and ease of sharing the database afterwards. (Update: Switched to local MySQL due to inability to run queries in one window and work on the dataset in another) Cue inefficient code....

With just a few lines of ActiveRecord setup, I now have all of my models available to me for easy processing. (Avoid .count on large InnoDB tables, it's much faster to count returned objects.)

> Artist.includes(:tracks).limit(10000).map { |a| a.tracks.size }.reduce(:+)/10000.0

=> 15.8011 

Suddenly, a whole world of easily accessible summary statistics open via a simple, familiar DSL. Exploring the sample dataset should be a snap. Now to create a classifier good enough to predict user ratings!

Next step: JRuby + AR + Weka. Weka is a Data Mining toolkit with good algorithms for exploring data, integration between AR and Weka should be a real treat!

Filed under  //   active record   data   data mining   ruby  

Initial Post

I've read that all programmers should blog. Doing so forces us to think critically about what we do and to explain ourselves -- certainly something I lack at times. This is my attempt to rectify my shortcomings in terms of being able to express myself in words.

This blog, for myself, is also somewhat catharsis. I'm a Computer Science student, a scant semester away from graduating from the somewhat known Colorado School of Mines. Though in graduating, I'll turn around the same day and ensure I'm registered for my classes of Fall 2012. I've become entangled in a 5-year combined bachelors/masters program. A frustrating endeavor. I love my field dearly but loathe the oftentimes frustrating coursework and exams which do little to test knowledge or problem solving, only testing recall. 

As anyone who knows me will tell you, recall isn't always my strongest quality.

To whinge about my academic woes is not the focus of this blog. I'm a Software Engineer, Computer Scientist, and tinkerer. I've found myself wanting to write down some of my thoughts on software and design practices. My explorations into Scala, Clojure are topics I feel worthy of writing on. Maybe one day someone will read my blog and gain some useful insight (or indeed, correct my incorrect insights).

Here's to hoping I succeed.

Filed under  //   obligatory   programming