The Boston Predictive Analytics meetup group is a mix of statisticians, software engineers, computer scientists, and analytics professional who occasionally get together to share their knowledge. If you’re in the Boston, MA area and care about analytics, you have to check it out. Big, big kudos to the organizer, John Verostek, for pulling the group together.
Last Saturday, March 10, I attended the Big Data Workshop, my first event as part of the group. I went into it with a vague idea of what Hadoop was and some awareness of the ecosystem of services that exists today and came out much better educated. What follows is a selection of the material presented, along with some of my notes.
What is Hadoop, by Vipin Sachdeva
- Hadoop provides the software infrastructure that enables users to tackle massive data jobs using parallel processing. Hadoop manages data replication, server failures, jobs, and the like automatically.
- The Hadoop Distributed File System (HDFS) is the core of the system. It is a distributed file system that allows for high throughput processing of data by replicating data across multiple nodes (machines).
- HDFS is made up of 1) a “master” namenode, which oversees all other nodes, manages data replication and jobs, etc., and 2) datanodes, which actually store data.
- MapReduce is the programming framework used by Hadoop to process jobs. The framework can be implemented in any number of languages, including Python and Java.
- Jobs implemented in MapReduce must be reducible to key-value pairs.
- MapReduce has several steps:
- Map: data blocks of predetermined size are sent to datanodes known as “mappers.”
- Sort (i.e., Shuffle): datanodes pass all of keys of a certain type to “reducer” nodes.
- Reduce: reducer nodes perform some operation on the key-value pairs to summarize the data and each output a data set.
- The results of a MapReduce operation can (and often are) fed into multiple other MapReduce operations for further summary.
- MapReduce– at least as was demonstrated– is implemented as separate scripts: one for mapping, and one for reducing.
Cloud Infrastructure and Market, by Jim O’Neil
Jim provided an overview of the market, and then went into live demos of Google App Engine, Amazon Elastic MapReduce, and Microsoft’s Windows Azure platform. The latter two were especially impressive from the standpoint of ease of use (once the mapper and reducer scripts are written). He borrowed this diagram from the National Institute of Standards and Technology
Demo of MortarDB, by K Young
K is MortarDB’s CEO, and he gave us a live demo of their platform. They use a stack of Amazon Web Services, Hadoop/Pig, and Python to make Hadoop easier to use. The platform was certainly impressive, but there was not much to share in terms of general knowledge.
Demo of Statistical Analysis using Cloud Numerics on Azure, by Roope Astala
Roope shared a demo of the types of statistical analyses made possible using SQL on top of Microsoft Azure cloud platform. The complete example can also be found here. While the functionality is no doubt really useful, the setup is complicated– very complicated. Have a look at that previous link to see what I mean.
Web Analytics on Hadoop, by Michael Sun
Michael works at CBS Interactive and helped deploy Hadoop/Python as a substitute to their previous custom ETL solution for populating a database used for web analytics. By the middle of 2012, CBS will be sitting on about 1 petabyte (1,000 terabytes) of data on a cluster of 80 nodes. Michael especially praised Hadoop for its painless scalability and zero-dollar licensing fees (it’s open source, after all). Below is a figure that summarizes what CBS does with its web data.
Big Data Step-by-Step, by Jeffrey Breen
Jeffrey covered a handful of topics, giving away a lot of code in the process. His talk was less about huge data sets and more about tackling data of a smaller scale (roughly, tens of gigabytes).
- Hadoop can be used in R via the rmr package. He walked through an example during the workshop.
- The infrastructure for deploying powerful, instantly scalable clusters is available through Amazon.
- In cases where we need more RAM than we have on hand, we can launch an Elastic Cloud Compute (EC2) instance with a Linux image and install RStudio. Our data can be stored on Amazon S3.
- In cases where we need a lot more computing power, we can install whirr and use it to launch an Amazon cluster without setting up (and taking down) each machine.





