Michael Rose

Software generalist, backend engineer, Hadoop/Storm bit-herder, building distributed systems on the JVM. Working for FullContact.

Overengineering: Log4j2’s AsyncAppender

| Comments

I spent a little time today thinking about Log4j2’s latest salvo in the war for Java Logging.

For those who aren’t aware, Log4j 1.x is a dying project on life-support. They’re locked into Java 4 support and can’t take advantage of Java 5 concurrency primitives which make multi-threaded execution vastly easier. Log4j will stab you in the back at some point. Logback was born, founded by the same people who dreamed up log4j. Until ‘recently’ it’s been Log4j’s successor; Log4j was forked, repackaged, and branded Log4j2 (org.apache.log4j). Log4j2 has been cleaning up and adding Java 5 concurrency and in some cases going even further and otherwise competing with Logback. Some of this is explained in my presentation, Ninth Circle of Hell, a Gentle Introduction to Java Logging.

Disruptor

Log4j2’s just released a brand new AsyncAppender. Without restrictions on library usage, the AsyncAppender uses LMAX Disruptor, a very cool JVM library for doing lock-free blisteringly fast internal queueing with pre-allocated ring buffers. In theory, this looks really good for Log4j2, but in the real world it essentially poses as meaningless benchmark fodder (and adds another transitive dependency, ouch).

Log4j2 posted some benchmarks on their Async Loggers page. They duke out Log4j 1, Log4j 2, and Logback in various configurations. Shown below is Solaris, JDK7, without locations (%class, %location or %line replacements).

image

Napkin math

For the sake of argument, I’ve taken 2000 lines of output from one of my Java applications that logs with Logback. On average, a line of logging is about 83 characters. 83 characters can be conservatively estimated at 83 bytes. In their tests, they’re using 500-byte payloads.

On Solaris, Log4j2 pushes out 18,000,000 (64 threads) messages per second. Meanwhile, Logback pumps out an acceptable 1,360,000 logs/s (64-threads).

On Windows, Log4j2 puts up a mind-boggling 24,754,080 messages/s and Logback 4,269,920/s.

Using Solaris’ numbers, 288,997 logs/s * 64 threads * 83 bytes = 1535.05 MB/s. Now I’m not sure about you, but I don’t dedicate a Fusion IO RAID to logging.

Numbers including locations are much lower and pretty similar to each other — logging isn’t the bottleneck.

Ultimately not worth it

I’ve concluded that the Log4j2’s AsyncAppender, while neat, is just a shiny toy and there’s really no appreciable difference for any sane application. Greatest respect for the Log4j2 team, but I really wish they’d focused on building a single, cohesive, next-generation Java logging framework rather than adding yet another logging framework to the mix.

Logback , however, natively implements SLF4J (the de-facto standard for classpath-binding logging) and has more than acceptable performance (especially with locations turned on). It continues to be my logging framework of choice. It’s easy to use and I’ve never run into any issues with it.

If nothing else, avoid Log4j 1.x (Or really anything but Log4j2 or Logback) like the plague. In any system with load, you’ll eventually find it causing contention issues.

Dropwizard Deployment for the Lazy Using Rake and Net::SSH

| Comments

Today I embarked on automated deployment of one of my basic Dropwizard/Angular.JS projects. This project’s purpose is to extract unstructured data from forum posts and turn them into data directly consumable by an API. It’s split into a four major pieces:

  • Core — reusable representation objects to keep things tidy between phases
  • Extraction — Extracts data from forum posts by executing a large unwieldy JDBC query and iterating over the results, then pushing each into MongoDB
  • Query — Dropwizard/JAX-RS REST API for exposing the data found by the extraction phase
  • Frontend — Angular.JS frontend app that consumes the REST API

As with many projects, they all begin somewhat manually, but today I wanted to automate the deployment of my project. Jenkins is a little heavy weight for a personal project, and I don’t really need a full-blown continuous integration framework. Capistrano is more along the lines I wanted, but is too Rails-flavored to be immediately useful for my needs.

Rake

Rake is simple. Stupid simple. It’s a Ruby DSL for running commands, a step up from a hand-rolled bash script.

Mahout & Dropwizard: Collaborative Filtering & Recommenders

| Comments

Recommenders are in use all throughout the web. Chances are, you’ve interacted with dozens of recommenders systems thought the course of simply finding this article.

Amazon is built on recommender systems.

Amazon Recommendations

Using techniques such as item-based recommenders and itemset generation they can not only find what’s relevant to you, they can determine things related to what you’re browsing and further find items frequently bought with yours (and perhaps provide a combo bonus to shake cobwebs from your wallet).

As with any Machine Learning task, the first step is to define your problem. What are you trying to do? Who is your customer? Is it your Business Intelligence team? Is it the consumer browsing your webapp? Is it your fault detection system attempting to draw relations? Spike detection?

The problem I’m attempting to solve is a somewhat common problem. On a site (which will remain nameless for now) members rate threads containing media. For the sake of simplicity, lets say the threads are albums posted by members along with their reviews of said albums. The members of this site are extremely opinionated, and want good recommendations.

Groovy on Storm

| Comments

At work, we end up using Groovy. A lot. In fact, most of our infrastructure is built using some combination of Groovy, Java, and a hint of Clojure. When Nathan Marz’s Storm came out we were understandably excited: forcing Hadoop to attempt to be a real time system was an absolute nightmare. Don’t try that. It’s not smart. Hadoop is a batch system.

[[posterous-content:kybnnyexwCAjCwCoavlk]]
Get out of there cat. You are not a data-centric batch processing solution.

We rearchitected, clearly. Hadoop is being used in the way its creator intended: bulk processing, and we’re playing with Storm as a means of going forward with other cool projects such as moving our realtime backend processes. This includes webhook processing and a lot more.

The first step is getting Groovy to run on Storm. In fit of sleeplessness, I created a shell project and corresponding pom.xml to get everything going.

https://github.com/xorlev/groovy-storm

A very basic project with a slightly modified ExclamationTopology (no inner classes or semicolons) and a working pom.xml which uses antrun to compile the Groovy code into an executable jar. After all is said and done, one need only run java -jar target/groovy-storm-0.1.0-jar-with-dependencies.jar

Pretty simple, now I can get started on the hard work and use the storm-contrib packages to feed from SQS (our primary message queue).

 

Data Platforms & ActiveRecord

| Comments

Working on a data platform for my 9th and final data mining project on the KDD Cup dataset. Transforming four text files of data and relations into a real, queryable dataset. In practice, I’m creating a data warehouse, albeit extraordinarily simplified.

Data transformation is almost never fun. It’s rote work, it’s dry, and time consuming. Today I decided to move away from my general MO for one-off data transformers and not embed SQL directly into the project.

Instead I decided to work with ActiveRecord. ActiveRecord is a wonderful Object-Relational Mapper pattern in use by many different libraries in tons of languages. Ruby, the general language of my choice in this course (and one of my favorite languages in general, I love Ruby) has a great ActiveRecord package in use by Ruby on Rails.

Initial Post

| Comments

I’ve read that all programmers should blog. Doing so forces us to think critically about what we do and to explain ourselves – certainly something I lack at times. This is my attempt to rectify my shortcomings in terms of being able to express myself in words.

This blog, for myself, is also somewhat catharsis. I’m a Computer Science student, a scant semester away from graduating from the somewhat known Colorado School of Mines. Though in graduating, I’ll turn around the same day and ensure I’m registered for my classes of Fall 2012. I’ve become entangled in a 5-year combined bachelors/masters program. A frustrating endeavor. I love my field dearly but loathe the oftentimes frustrating coursework and exams which do little to test knowledge or problem solving, only testing recall. 

As anyone who knows me will tell you, recall isn’t always my strongest quality.

To whinge about my academic woes is not the focus of this blog. I’m a Software Engineer, Computer Scientist, and tinkerer. I’ve found myself wanting to write down some of my thoughts on software and design practices. My explorations into Scala, Clojure are topics I feel worthy of writing on. Maybe one day someone will read my blog and gain some useful insight (or indeed, correct my incorrect insights).

Here’s to hoping I succeed.