Data analysis with awk

awk is a programming language available with most *nix systems and made available for Windows via cygwin. Some features of awk:

  • Operates directly on text files, line by line, without needing to load them into memory. Because of this, it’s also…
  • Very fast and works very well on large data sets (multi-gigabyte)
  • Can perform a lot of SQL-like operations without the overhead of a database– think counts, sums, conditional subsetting of data (including with regular expressions), etc.

On the down side, awk interprets delimiters literally. If you have a CSV that has commas in the middle of its fields, awk will interpret those commas as field separators. The only way around it that I have seen it to use some clever regex.

Here are is a sample workflow, combining awk with command-line sorting and deduplication tools. The goal is obtain all records that match a set of IDs that meet a particular condition:

  • Print the IDs (column 6) where a certain field is >=1. To print the whole record, swap $6 for $0:
cat feed_20130401.csv | awk -F "," '$17 >= 1 {print $6}' > out.csv
  • Sort and deduplicate a list, piping the output from sort directly into uniq for deduping:
sort out.csv | uniq -u > out_sorted_deduped.csv
  • Get all matches to the deduplicated set of IDs:
awk -F "," 'FNR==NR{a[$0];next}($6 in a)' out_sorted_deduped.csv feed_20130401.csv > matches.csv

Launch an AWS EC2 Ubuntu instance, SSH into it (from a Mac), and install Octave

Launch the Linux instance

1. Create a new key pair to obtain the .pem file with your key (if you’ve misplaced the old one).

2. Ensure only your user can access that .pem file by issuing the following from a terminal window:

chmod 600 <path-to-key-file>

3. Launch the instance from EC2, and note which key pair you used.

4. From a terminal window, issue:

ssh -i <path-to-key-file> root@<instance-name>

If running Ubuntu, swap ‘root’ for ‘ubuntu’.

Install Octave

Detailed instructions are available at the Octave wiki, but the short version is:

sudo apt-add-repository ppa:octave/stable
sudo apt-add-repository ppa:octave/unstable
sudo apt-get update
sudo apt-get install octave

Voila. Issue ‘octave’ to launch Octave.

Multiple correspondence analysis: principal components analysis for categorical data

Principal components analysis is designed to deal only with continuous variables. Unless the values of a categorical variable carry ordinal significance (think Likert scale), PCA may not yield useful results. In particular, variables may load very differently depending on arbitrary choices regarding how to handle the dummies (e.g., which, if any, to exclude). Multiple correspondence analysis is an alternative method for dealing with the dimensionality reduction problem in the context of categorical variables, and it can be easily performed in SAS.

Crisp, high-quality mapping with R and ggplot2

This is a bit of code I worked on to generate several state-level maps for a client deck. Perhaps the coolest part is that, if you spend a little time to consolidate all of the data into a single input data set, iterating these maps for different variables (in my case, different products) is a snap. Here is what a sample map might look like, using code borrowed from Data in Colour. I mentioned ggplot2 in an earlier post, in which I pondered what an affordable enterprise data viz “stack” might look like.

The code below differs a bit from the one used to generate the above map, but the end product is the same. This works for me as copy/paste after changing the source for the CSV file and installing the libraries.

 # Install these libs before using this code
 # assumes your data set is comma-delimited and structured as follows:
 # state_name, overall
 # michigan, 0.1
 # alabama, 0.2
 # connecticut, 0.15
 # etc

My data viz technology stack proposal

With data viz taking on a new and unexpected urgency at work, I did a bit of research to figure out what is available in terms of tools. This excellent post at pulls together a bunch of the more technical and open source tools, while this post focuses on some of the more commercial alternatives.

All of this got me thinking about the optimal set of tools to have on hand. Of course everyone’s needs are different, but I think the data viz “stack” below provides a solid, affordable progression of tools that many data-driven organizations will find useful.

  • Excel. This covers most bases, and it’s basically on everyone’s work machine. Excel charts can look very sharp if you make a handful of adjustments to them.
  • Tableau (plus Google’s API-based charting tools). Fancy Excel plus mapping, for when you need a nice dashboard and graphs that aren’t wonky by default. Tableau can load up Google’s tools via HTML referencing, and Google has some slick charting tools available. As a matter of fact, Tableau can show anything on the web as part of a dashboard. Tableau licenses are about $1,000/person last I checked.
  • R. For my use cases, R will be most useful for more unusual viz or graphs that require a lot of customization; for instance, heat maps (I can make a dandy heat map in Excel, but it’s a pain to revise). Here are some examples at Flowing Data. These can achieve a degree of polish far above and beyond Tableau, as customization gets very fine-grained. I’ve had a very good experience with ggplot2, for instance. R is very, extremely free.
  • Processing. For projects requiring really bespoke implementations. The world is your visualization oyster if you’re working in Processing, but I hope your oyster has a lot of time in it because it’s probably going to take you a while to develop something. But it is free!

Note that I listed these in order of visualization complexity: simpler needs can be covered by Excel, while the most challenging of projects will likely need a highly-customized implementation in Processing. If I had to hazard a guess, I would say maybe 80% of my visualization needs could be served adequately by Excel, with the last 20% being by and large solvable with Tableau.

Training data set balancing and rare event detection

A few useful perspectives and experiments related to detecting rare events. I reproduce the papers’ abstracts (or at least part of the abstracts) below.

Data Set Balancing, David L. Olson

This paper conducts experiments with three skewed data sets, seeking to demonstrate problems when skewed data is used, and identifying counter problems when data is balanced. The basic data mining algorithms of decision tree, regression-based, and neural network models are considered, using both categorical and continuous data. Two of the data sets have binary outcomes, while the third has a set of four possible outcomes. Key findings are that when the data is highly unbalanced, algorithms tend to degenerate by assigning all cases to the most common outcome. When data is balanced, accuracy rates tend to decline. If data is balanced, that reduces the training set size, and can lead to the degeneracy of model failure through omission of cases encountered in the test set. Decision tree algorithms were found to be the most robust with respect to the degree of balancing applied.

Sampling Bias and Class Imbalance in Maximum-likelihood Logistic Regression, Oomen et al.

We hypothesize that the predictive capability of the model is related to the sampling bias associated with the data so that the MLLR [LF: maximum likelihood logistic regression] model has perfect predictability when the data have no sampling bias. We test our hypotheses using two simulated datasets with class distributions that are 50:50 and 80:20, respectively. We construct a suite of controlled experiments by extracting multiple samples with varying class imbalance and sampling bias from the two simulated datasets and fitting MLLR models to each of these samples. The experiments suggest that it is important to develop a sample that has the same class distribution as the original population rather than ensuring that the classes are balanced. Furthermore, when sampling bias is reduced either by using over-sampling or under-sampling, both sampling techniques can improve the predictive capability of an MLLR model.

Logistic Regression in Rare Events Dat, by Gary Kind and Lanche Zeng

We study rare events data, binary dependent variables with dozens to thousands of times fewer ones (events, such as wars, vetoes, cases of political activism, or epidemiological infections) than zeros (“nonevents”). In many literatures, these variables have proven difficult to explain and predict, a problem that seems to have at least two sources. First, popular statistical procedures, such as logistic regression, can sharply underestimate the probability of rare events. We recommend corrections that outperform existing methods and change the estimates of absolute and relative risks by as much as some estimated effects reported in the literature. Second, commonly used data collection strategies are grossly inefficient for rare events data. The fear of collecting data with too few events has led to data collections with huge numbers of observations but relatively few, and poorly measured, explanatory variables, such as in international conflict data with more than a quarter-million dyads, only a few of which are at war. As it turns out, more efficient sampling designs exist for making valid inferences, such as sampling all available events (e.g., wars) and a tiny fraction on nonevents (peace). This enables scholars to save as much as 99% of their (nonfixed) data collection costs or to collect much more meaningful explanatory variables. We provide methods that link these two results, enabling both types of corrections to work simultaneously, and software that implements the methods developed.