Boston Predictive Analytics Big Data Workshop Highlights

Mar 15 2012

The Boston Predictive Analytics meetup group is a mix of statisticians, software engineers, computer scientists, and analytics professional who occasionally get together to share their knowledge. If you’re in the Boston, MA area and care about analytics, you have to check it out. Big, big kudos to the organizer, John Verostek, for pulling the group together.

Last Saturday, March 10, I attended the Big Data Workshop, my first event as part of the group. I went into it with a vague idea of what Hadoop was and some awareness of the ecosystem of services that exists today and came out much better educated. What follows is a selection of the material presented, along with some of my notes.

What is Hadoop, by Vipin Sachdeva

  • Hadoop provides the software infrastructure that enables users to tackle massive data jobs using parallel processing. Hadoop manages data replication, server failures, jobs, and the like automatically.
  • The Hadoop Distributed File System (HDFS) is the core of the system. It is a distributed file system that allows for high throughput processing of data by replicating data across multiple nodes (machines).
  • HDFS is made up of 1) a “master” namenode, which oversees all other nodes, manages data replication and jobs, etc., and 2) datanodes, which actually store data.
  • MapReduce is the programming framework used by Hadoop to process jobs. The framework can be implemented in any number of languages, including Python and Java.
  • Jobs implemented in MapReduce must be reducible to key-value pairs.
  • MapReduce has several steps:
    • Map: data blocks of predetermined size are sent to datanodes known as “mappers.”
    • Sort (i.e., Shuffle): datanodes pass all of keys of a certain type to “reducer” nodes.
    • Reduce: reducer nodes perform some operation on the key-value pairs to summarize the data and each output a data set.
  • The results of a MapReduce operation can (and often are) fed into multiple other MapReduce operations for further summary.
  • MapReduce– at least as was demonstrated– is implemented as separate scripts: one for mapping, and one for reducing.

 

Cloud Infrastructure and Market, by Jim O’Neil

Jim provided an overview of the market, and then went into live demos of Google App Engine, Amazon Elastic MapReduce, and Microsoft’s Windows Azure  platform. The latter two were especially impressive from the standpoint of ease of use (once the mapper and reducer scripts are written). He borrowed this diagram from the National Institute of Standards and Technology

 

Demo of MortarDB, by K Young

K is MortarDB’s CEO, and he gave us a live demo of their platform. They use a stack of Amazon Web Services, Hadoop/Pig, and Python to make Hadoop easier to use. The platform was certainly impressive, but there was not much to share in terms of general knowledge.

 

Demo of Statistical Analysis using Cloud Numerics on Azure, by Roope Astala

Roope shared a demo of the types of statistical analyses made possible using SQL on top of Microsoft Azure cloud platform. The complete example can also be found here. While the functionality is no doubt really useful, the setup is complicated– very complicated. Have a look at that previous link to see what I mean.

 

Web Analytics on Hadoop, by Michael Sun

Michael works at CBS Interactive and helped deploy Hadoop/Python as a substitute to their previous custom ETL solution for populating a database used for web analytics. By the middle of 2012, CBS will be sitting on about 1 petabyte (1,000 terabytes) of data on a cluster of 80 nodes. Michael especially praised Hadoop for its painless scalability and zero-dollar licensing fees (it’s open source, after all). Below is a figure that summarizes what CBS does with its web data.

 

Big Data Step-by-Step, by Jeffrey Breen

Jeffrey covered a handful of topics, giving away a lot of code in the process. His talk was less about huge data sets and more about tackling data of a smaller scale (roughly, tens of gigabytes).

  • Hadoop can be used in R via the rmr package. He walked through an example during the workshop.
  • The infrastructure for deploying powerful, instantly scalable clusters is available through Amazon.
    • In cases where we need more RAM than we have on hand, we can launch an Elastic Cloud Compute (EC2) instance with a Linux image and install RStudio. Our data can be stored on Amazon S3.
    • In cases where we need a lot more computing power, we can install whirr and use it to launch an Amazon cluster without setting up (and taking down) each machine.

No responses yet

In which Brandy makes wedding invitations

Feb 04 2012

image

No responses yet

Calling R from within SAS: a macro solution

Jan 26 2012

This is one easy way to execute R scripts from within SAS, brought to us by Xin Wei:

For SAS users, the macro is a huge productivity booster, allowing one to easily complete data management and/or partial data analysis in SAS, skip out quickly to R for analyses that are awkward or impossible in SAS, then return to SAS for completion. For people in industry, this may also ease integrating R into documentation systems built for SAS code.

Original post here, code here, full write up here.

No responses yet

Massachusetts firms ahead of the game in tackling big data challenges

Jan 23 2012

A friend points me to a Boston Globe article that discusses the industry in Massachusetts, surveys the market, and assesses its growth prospects. Besides the usual predictions of explosive growth, I found this interesting:

Ninety percent of the data stored on hard drives, on Internet servers, or in big databases has been collected in just the past two years, according to IBM.

Whole article here.

No responses yet

The first real snow of the season in Cambridge

Jan 20 2012

image

No responses yet

Adweek – Whose life is it, anyway?

Jan 18 2012

One of the best lines I read recently was something like, “if you’re not paying for it, you’re not the customer– you’re the product.” As we continue to enjoy increasingly sophisticated and customized web services, we have to be conscious of the fact that we enter into an implicit bargain: you give me a neat service, and I give you my personal information and (maybe more importantly) the permission to watch and record what I do.

Ki Mae Heussner at Adweek provides a good survey of where we are at in terms of web privacy and where different camps are putting their money for the future. It’s clear why:

… Each year, companies in the U.S. spend more than $2 billion on third-party consumer data, according to Forrester Research. Add in the money spent on credit data, market research and other kinds of derived information, the research firm says, and you’re looking at a multibillion dollar industry. In fact, the volume of digital data created by consumers is growing at such a fast clip that the World Economic Forum and other futurists have called personal data the “new oil.”

I recommend the whole article. The degree of predictive specificity these data will enable will be unparalleled.

No responses yet

A new resource for teaching yourself SAS outside of school or work

Jan 18 2012

Via Chris Hemedinger, we learn that SAS is now offering relatively cheap access to a SAS-hosted, learning-only version of Base SAS for as little as $200 over 6 months. This is invaluable, since at the time of this writing, a desktop Base SAS license is going for $8,500 per year. If you’re looking to learn SAS, this is your ticket to ride:

This new SAS OnDemand offering complements the SAS OnDemand for Academics offering that has evolved over the past few years.  It’s SAS, running on “the cloud”, and you use a supplied version of SAS Enterprise Guide to access it (along with a good collection of sample data).  With SAS Enterprise Guide you can exercise most of the SAS features that you would need to practice for any career objective: learn SAS programming, hone your skills in business analytics, or use high-end statistical methods to analyze data.

 

 

No responses yet

An afternoon at Great Brook Farm State Park, MA

Jan 17 2012

image

Continue Reading »

No responses yet

Yahoo! Developer Network’s introduction to Hadoop

Jan 16 2012

One of my goals for the year is to be conversant in Hadoop. What is Hadoop?

Hadoop is a large-scale distributed batch processing infrastructure. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is also designed to efficiently distribute large amounts of work across a set of machines.

Yahoo’s introduction to Hadoop is very, very good– to date, the best I’ve found on the web. The first module is not only a good introduction to Hadoop, but more generally to the direction in which seriously scalable data management and processing (peta/exabyte level) is moving.

No responses yet

What is Big Data?

Jan 16 2012

Over at O’Reilly, Edd Dumbill throws down an excellent introduction to the topic of big data:

Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.

The hot IT buzzword of 2012, big data has become viable as cost-effective approaches have emerged to tame the volume, velocity and variability of massive data. Within this data lie valuable patterns and information, previously hidden because of the amount of work required to extract them. To leading corporations, such as Walmart or Google, this power has been in reach for some time, but at fantastic cost. Today’s commodity hardware, cloud architectures and open source software bring big data processing into the reach of the less well-resourced. Big data processing is eminently feasible for even the small garage startups, who can cheaply rent server time in the cloud.

 

No responses yet

Older posts »

site tracking with Asynchronous Google Analytics plugin for Multisite by WordPress Expert at Web Design Jakarta.