Data analysis with awk

awk is a programming language available with most *nix systems and made available for Windows via cygwin. Some features of awk:

  • Operates directly on text files, line by line, without needing to load them into memory. Because of this, it’s also…
  • Very fast and works very well on large data sets (multi-gigabyte)
  • Can perform a lot of SQL-like operations without the overhead of a database– think counts, sums, conditional subsetting of data (including with regular expressions), etc.

On the down side, awk interprets delimiters literally. If you have a CSV that has commas in the middle of its fields, awk will interpret those commas as field separators. The only way around it that I have seen it to use some clever regex.

Here are is a sample workflow, combining awk with command-line sorting and deduplication tools. The goal is obtain all records that match a set of IDs that meet a particular condition:

  • Print the IDs (column 6) where a certain field is >=1. To print the whole record, swap $6 for $0:
cat feed_20130401.csv | awk -F "," '$17 >= 1 {print $6}' > out.csv
  • Sort and deduplicate a list, piping the output from sort directly into uniq for deduping:
sort out.csv | uniq -u > out_sorted_deduped.csv
  • Get all matches to the deduplicated set of IDs:
awk -F "," 'FNR==NR{a[$0];next}($6 in a)' out_sorted_deduped.csv feed_20130401.csv > matches.csv

Launch an AWS EC2 Ubuntu instance, SSH into it (from a Mac), and install Octave

Launch the Linux instance

1. Create a new key pair to obtain the .pem file with your key (if you’ve misplaced the old one).

2. Ensure only your user can access that .pem file by issuing the following from a terminal window:

chmod 600 <path-to-key-file>

3. Launch the instance from EC2, and note which key pair you used.

4. From a terminal window, issue:

ssh -i <path-to-key-file> root@<instance-name>

If running Ubuntu, swap ‘root’ for ‘ubuntu’.

Install Octave

Detailed instructions are available at the Octave wiki, but the short version is:

sudo apt-add-repository ppa:octave/stable
sudo apt-add-repository ppa:octave/unstable
sudo apt-get update
sudo apt-get install octave

Voila. Issue ‘octave’ to launch Octave.

Multiple correspondence analysis: principal components analysis for categorical data

Principal components analysis is designed to deal only with continuous variables. Unless the values of a categorical variable carry ordinal significance (think Likert scale), PCA may not yield useful results. In particular, variables may load very differently depending on arbitrary choices regarding how to handle the dummies (e.g., which, if any, to exclude). Multiple correspondence analysis is an alternative method for dealing with the dimensionality reduction problem in the context of categorical variables, and it can be easily performed in SAS.

Crisp, high-quality mapping with R and ggplot2

This is a bit of code I worked on to generate several state-level maps for a client deck. Perhaps the coolest part is that, if you spend a little time to consolidate all of the data into a single input data set, iterating these maps for different variables (in my case, different products) is a snap. Here is what a sample map might look like, using code borrowed from Data in Colour. I mentioned ggplot2 in an earlier post, in which I pondered what an affordable enterprise data viz “stack” might look like.

The code below differs a bit from the one used to generate the above map, but the end product is the same. This works for me as copy/paste after changing the source for the CSV file and installing the libraries.

 # Install these libs before using this code
 # assumes your data set is comma-delimited and structured as follows:
 # state_name, overall
 # michigan, 0.1
 # alabama, 0.2
 # connecticut, 0.15
 # etc