Today I embarked on automated deployment of one of my basic Dropwizard/Angular.JS projects. This project’s purpose is to extract unstructured data from forum posts and turn them into data directly consumable by an API. It’s split into a four major pieces:
Core – reusable representation objects to keep things tidy between phases
Extraction – Extracts data from forum posts by executing a large unwieldy JDBC query and iterating over the results, then pushing each into MongoDB
Query – Dropwizard/JAX-RS REST API for exposing the data found by the extraction phase
Frontend – Angular.JS frontend app that consumes the REST API
As with many projects, they all begin somewhat manually, but today I wanted to automate the deployment of my project. Jenkins is a little heavy weight for a personal project, and I don’t really need a full-blown continuous integration framework. Capistrano is more along the lines I wanted, but is too Rails-flavored to be immediately useful for my needs.
Rake
Rake is simple. Stupid simple. It’s a Ruby DSL for running commands, a step up from a hand-rolled bash script.
I had three goals: build the project, deploy the frontend, deploy the API. All three turned out to be simple. Deploying the API builds the project, runs tests, and deploys the uberjar via rsync.
Deploying the frontend simply rsyncs the site remotely. Eventually, it’ll compile assets and version them, but that’s another project. :)
rakefile
1234567891011121314151617181920212223242526272829
require'net/ssh'desc"Create a distribution package"task:packagedosh"mvn package"endnamespace"deploy"dohost='my.ssh.host'user='myuser'options={:keys=>'~/.ssh/id_dsa'}desc"Deploy the frontend webapp"task:frontenddoremote_dir="/home/mysite/www/library"sh"rsync -avz library-frontend #{host}:#{remote_dir}"enddesc"Deploy the Library REST API (non-rolling)"task:api=>[:package]doremote_dir="/home/mysite/api"sh"rsync library-api/target/library-api-1.0-SNAPSHOT.jar #{host}:#{remote_dir}/library-api.jar"Net::SSH.start(host,user,options)do|ssh|putsssh.exec!'supervisorctl restart library-api'endendend
The only aspect of this deployment not managed by rake is process management which Supervisord controls easily and efficiently.
It’s not a rolling deploy, but Dropwizard boots so quickly I have no need for a complicated deployment, especially on a personal project.
Recommenders are in use all throughout the web. Chances are, you’ve interacted with dozens of recommenders systems thought the course of simply finding this article.
Amazon is built on recommender systems.
Using techniques such as item-based recommenders and itemset generation they can not only find what’s relevant to you, they can determine things related to what you’re browsing and further find items frequently bought with yours (and perhaps provide a combo bonus to shake cobwebs from your wallet).
As with any Machine Learning task, the first step is to define your problem. What are you trying to do? Who is your customer? Is it your Business Intelligence team? Is it the consumer browsing your webapp? Is it your fault detection system attempting to draw relations? Spike detection?
The problem I’m attempting to solve is a somewhat common problem. On a site (which will remain nameless for now) members rate threads containing media. For the sake of simplicity, lets say the threads are albums posted by members along with their reviews of said albums. The members of this site are extremely opinionated.
The problem with a 1-5 star rating system is that it really doesn’t tell you anything about the content other than the percieved quality of the resource. It unfortunately loses a lot of resolution which might be important to your use case. Some members may enjoy an album for its rich harmonics and energetic, uplifting lyrics. Some may hate it because it’s a social commentary on the sad state of US Foreign Policy. Such users generally fall into cohorts of like-minded users who all tend to rate things on similar scales. Nickleback’s newest album could easily garner a rotten 1-star from all the Sabaton and Hammerfall fans.
The theory is to exploit this similarity between members to cluster them. Once clustered, the sets of all the users can be overlayed and with very basic set magic, find new content the user may not have seen previously that based on his cohort could lead to a very interesting musical adventure for him. Absolutely fantastic! How about some details? This is known as collaborative filtering.
The first step is a neighborhood generation algorithm. A users’ neighborhood is the attempt to define a user’s cohort. Having a neighborhood allows you to sample from it and create a network. In Mahout, there are two implementations: ThresholdUserNeighborhood and UserNeighborhood. Threshold Neighborhoods include all users nearest you in terms of similarity.
Similarity is defined as taking a mathematical operation to take two users (or items) and compare them in a quantitative fashion. You can read more on similarity metrics.
Pearson Similarity is a good choice for rating-based data. It’s the same algoritm used in determining the correlation of data – see a scatterplot, is it trending or is it unfocused bullshit?
Pearson correlation determines if your data generally trends. We’ll exploit this to find users similar to each other, even the ones that consistently rate items at different baselines.
Cool. So we’ve defined our similarity metrics. We’ve explored neighborhoods and decided a threshold-based neighborhood might work well. What’s next? Lets get some data.
Preparing your data
Preparing your data is easy. All you need is a CSV (comma-separated) or TSV (tab-separated). For ease of reading, I’ll be using CSV but TSV works just as well.
The format of a mahout datafile looks a little like this:
userID,itemID[,preference[,timestamp]]
userid and itemid are parsed as longs
preference value is a double, and is optional. If omitted, it’ll be a binary preference.
timestamp, useful but optional
Additional fields or lines prefixed with # are ignored
By the end of your extraction, any .csv will do. You can optionally compress your datafile with gzip or zip and the FileDataModel will have no problems reading it.
My MySQL extraction script looks like this:
extract_ratings.sh
1234
echo"SELECT tr.userid,tr.threadid,tr.vote,UNIX_TIMESTAMP(tr.date) as date,t.forumid FROM threadrate tr LEFT JOIN thread t USING(threadid) WHERE date IS NOT NULL" | \mysql vbulletin | \tr '\t'',' | \tail -n +2 > ratings.csv
The Code
Such code can be written with a basic understand of the algorithm and a head for details. Fortunately, all the hard work has been done for you in Apache Mahout. Mahout is a machine learning toolkit, a swiss-army knife of getting stuff done, and doing it fast. Much of Mahout’s work can be farmed off to your favorite Hadoop cluster which makes it ideal for a lot of data.
I don’t have a lot of data, it’s a little over 50,000 rows but it’s all real data. There isn’t a right answer. Another good dataset to play with is the Grouplens dataset as the parameters of that data has been well established.
The next stop is to build a basic evaluator. You can use this to explore the effect of different similarity measures, user neighborhoods, preference inferrers, and other techniques to improve your algorithm for your use case.
We now have a decent recommender which appears to be returning results well in the validation set. By not training with 100% of the data we can avoid model overfitting and see how well our model and data really fit the problem we’re trying to solve.
For my use case, Pearson worked better than the rest. Extend GenericUserBasedRecommender and we’re off to the races!
The recommender works, but then the question becomes: how do you make it useful? In a Java application, the answer is relatively simple: just embed it. But this vBulletin, a PHP forum package, we need to interface the Java and PHP components. With that in mind, I set out to wrap the recommender in a HTTP interface useful cross-process and via XHR (AJAX) requests.
The first technology I tried was the mahout-integrations package. The mahout-integrations package contains a servlet useful for serving out recommendations as a plain-text response, XML, or JSON. The configuration is easy as well, and you can embed Jetty in a shaded jar with almost no issues. The code is pretty weak sauce, but if it works it works right?
Wrong. While the Jetty+Servlet approach is easy, I have the requirement of updating tastes on my recommender as well as periodically refreshing the datamodel without a full stop/start cycle of the application (during which time it’d be unavailable – unacceptable, this is life or death :) ). I set out to enhance the default servlet to support this behavior, only to be handily thwarted by final classes. My initial thought was to delegate to the recommender (hah).
Okay, I understand the Open/Closed Principle, but in practice it’s a headache and a half of the class isn’t written to support common operations. Normally I’d delegate to the servlet, but the recommender its self is marked private and I didn’t really want to deal with using the RecommenderSingleton to get the recommender again for an operation which should already be supported. In the end I grabbed the source code and modified it to my own needs, flying brazenly in the face of DRY code and OCP.
This worked in production, but I really didn’t like it. As my requirements shifted and I felt worse and worse about the code and the momentary downtimes (I needed to support faster reloading) I decided to rewrite the service in Dropwizard. Dropwizard is a high-productivity, lightweight collection of some of the best battle-tested libraries Java has to offer in terms of performance and productivity wrapped up and given a great environment manager and configuration manager. Mix in Coda Hale’s best-of-breed Metrics library and you’re all set. Like many of the most brilliant of ideas, in hindsight Coda’s mashup seems as obvious as it is powerful.
Writing a basic resource to serve out RecommendedItems is incredibly easy. RecommendedItem, already being a basic POJO is effortlessly serialized by Jackson to JSON. XML is just as easy to produce.
By the end of it, I felt a little disappointed that it wasn’t more arcane, but pleased at how easy everything was.
On the other end, a basic json_decode() makes the data useful again to display to users.
In a future post, I’ll go over using the JDBC backend to recommendations to allow for seamless reloading of preferences and building endpoints to establish new preferences in real time.
At work, we end up using Groovy. A lot. In fact, most of our infrastructure is built using some combination of Groovy, Java, and a hint of Clojure. When Nathan Marz’s Storm came out we were understandably excited: forcing Hadoop to attempt to be a real time system was an absolute nightmare. Don’t try that. It’s not smart. Hadoop is a batch system.
[[posterous-content:kybnnyexwCAjCwCoavlk]] Get out of there cat. You are not a data-centric batch processing solution.
We rearchitected, clearly. Hadoop is being used in the way its creator intended: bulk processing, and we’re playing with Storm as a means of going forward with other cool projects such as moving our realtime backend processes. This includes webhook processing and a lot more.
The first step is getting Groovy to run on Storm. In fit of sleeplessness, I created a shell project and corresponding pom.xml to get everything going.
A very basic project with a slightly modified ExclamationTopology (no inner classes or semicolons) and a working pom.xml which uses antrun to compile the Groovy code into an executable jar. After all is said and done, one need only run java -jar target/groovy-storm-0.1.0-jar-with-dependencies.jar
Pretty simple, now I can get started on the hard work and use the storm-contrib packages to feed from SQS (our primary message queue).
Working on a data platform for my 9th and final data mining project on the KDD Cup dataset. Transforming four text files of data and relations into a real, queryable dataset. In practice, I’m creating a data warehouse, albeit extraordinarily simplified.
Data transformation is almost never fun. It’s rote work, it’s dry, and time consuming. Today I decided to move away from my general MO for one-off data transformers and not embed SQL directly into the project.
Instead I decided to work with ActiveRecord. ActiveRecord is a wonderful Object-Relational Mapper pattern in use by many different libraries in tons of languages. Ruby, the general language of my choice in this course (and one of my favorite languages in general, I love Ruby) has a great ActiveRecord package in use by Ruby on Rails.
It was surprisingly easy to implement my own models and begin using them with ActiveRecord outside of Rails.
The KDD Cup dataset defines a related set of Tracks, Genres, Albums, and Artists. All very routine for a relational database. My relational DB of choice being SQLite, given the lack of setup required and ease of sharing the database afterwards. (Update: Switched to local MySQL due to inability to run queries in one window and work on the dataset in another) Cue inefficient code….
With just a few lines of ActiveRecord setup, I now have all of my models available to me for easy processing. (Avoid .count on large InnoDB tables, it’s much faster to count returned objects.)
Suddenly, a whole world of easily accessible summary statistics open via a simple, familiar DSL. Exploring the sample dataset should be a snap. Now to create a classifier good enough to predict user ratings!
Next step: JRuby + AR + Weka. Weka is a Data Mining toolkit with good algorithms for exploring data, integration between AR and Weka should be a real treat!
I’ve read that all programmers should blog. Doing so forces us to think critically about what we do and to explain ourselves – certainly something I lack at times. This is my attempt to rectify my shortcomings in terms of being able to express myself in words.
This blog, for myself, is also somewhat catharsis. I’m a Computer Science student, a scant semester away from graduating from the somewhat known Colorado School of Mines. Though in graduating, I’ll turn around the same day and ensure I’m registered for my classes of Fall 2012. I’ve become entangled in a 5-year combined bachelors/masters program. A frustrating endeavor. I love my field dearly but loathe the oftentimes frustrating coursework and exams which do little to test knowledge or problem solving, only testing recall.
As anyone who knows me will tell you, recall isn’t always my strongest quality.
To whinge about my academic woes is not the focus of this blog. I’m a Software Engineer, Computer Scientist, and tinkerer. I’ve found myself wanting to write down some of my thoughts on software and design practices. My explorations into Scala, Clojure are topics I feel worthy of writing on. Maybe one day someone will read my blog and gain some useful insight (or indeed, correct my incorrect insights).