Showing posts from November, 2011

Seven years of blogging in less than 500 words

In a couple of days, this blog will be 7 years old. Hard to believe so much time has passed since my first Hello World post.
I keep reading that blogging is on the wane, and it's true, mainly because of the popularity of Twitter. But I strongly believe that blogging is still important, and that more people should do it. For me, it's a way to give back to the community. I can't even remember how many times I found solutions to my technical problems by reading a blog post. I personally try to post something on my blog every single time I solve an issue that I've struggled with. If you post documentation to a company wiki (assuming it's not confidential), I urge you to try to also blog publicly about it – think of it as a public documentation that can help both you and others.
Blogging is also a great way to further your career. Back in September 2008 I blogged about my experiences with EC2. I had launched an m1.small instance and I had started to play with it. L…

Troubleshooting memory allocation errors in Elastic MapReduce

Yesterday we ran into an issue with some Hive scripts running within an Amazon Elastic MapReduce cluster. Here's the error we got:

Caused by: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect( at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect( at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.processOp( ... 11 more Caused by: Cannot run program "bash": error=12, Cannot allocate memory at java.lang.ProcessBuilder.start( at org.apache.hadoop.util.Shell.runCommand( at at org.apache.hadoop.fs.DF.getAvailable( at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite( at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite( …

Experiences with Amazon Elastic MapReduce

We started to use AWS Elastic MapReduce (EMR) in earnest a short time ago, with the help of Bradford Stephens from Drawn to Scale. We needed somebody to jumpstart our data analytics processes and workflow, and Bradford's help was invaluable. At some point we'll probably build our own Hadoop cluster either in EC2 or in-house, but for now EMR is doing the job just fine.

We started with an EMR cluster containing the master + 5 slave nodes, all m1.xlarge. We still have that cluster up and running, but in the mean time I've also experimented with launching clusters, running our data analytics processes on them, then shutting them down -- which is the 'elastic' type of workflow that takes full advantage of the pay-per-hour model of EMR.

Before I go into the details of launching and managing EMR clusters, here's the general workflow that we follow for our data analytics processes on a nightly basis:

We gather data from various sources such as production databases, ad i…