andrew ng on deep learning at baidu

Baidu is an incredibly nimble company. Stuff just moves, decisions get made incredibly quickly. There’s a willingness to try things out to see if they work. I think that’s why Baidu, as far as I can tell, has shipped more deep-learning products than any other company, including things at the heart of our business model. Our advertising today is powered by deep learning.

Read up on his full interview with WSJ. This is, of course, the same Andrew Ng of Stanford machine learning fame.

contributor by google: pay not to see ads

For all of the complaints about ads being a bad way of supporting the “free” web, it’s the ad technology companies that are innovating.

The Mountain View, Calif.-based tech giant is testing a program called Contributor by Google with 10 publishers, including The Onion, Imgur and Mashable, in which users can pay $1, $2 or $3 a month to not see ads. A “thank you” message will appear in place of the promos, and the user will pitch in pennies from their $1 to $3 allotment every time they visit one of the sites.

More at Adweek. And here is the official site.

FnordMetric ChartSQL: Charting using SQL

FnordMetric ChartSQL allows you to write SQL queries that return charts instead of tables. The charts are rendered as SVG vector graphics and can easily be embedded into any website and customized with css in order to build beautiful dashboards

Lots of examples here, all of them with some degree of interactivity. The framework comes bundled with a standalone HTTP server app, and the charts are simple and effective:

chartsql-example

More here.

Facebook’s top open data problems

It’s a huge post discussing data problems big and small, as well as their existing data stores. The big data challenges include:

  • Sampling at logging time in a distributed environment
  • Trading storage space and CPU
  • Reliability of pipelines
  • Globally distributed warehouse
  • Time series correlation and anomaly detection

Much, much more here.

Data analysis with awk

awk is a programming language available with most *nix systems and made available for Windows via cygwin. Some features of awk:

  • Operates directly on text files, line by line, without needing to load them into memory. Because of this, it’s also…
  • Very fast and works very well on large data sets (multi-gigabyte)
  • Can perform a lot of SQL-like operations without the overhead of a database– think counts, sums, conditional subsetting of data (including with regular expressions), etc.

On the down side, awk interprets delimiters literally. If you have a CSV that has commas in the middle of its fields, awk will interpret those commas as field separators. The only way around it that I have seen it to use some clever regex.

Here are is a sample workflow, combining awk with command-line sorting and deduplication tools. The goal is obtain all records that match a set of IDs that meet a particular condition:

  • Print the IDs (column 6) where a certain field is >=1. To print the whole record, swap $6 for $0:
cat feed_20130401.csv | awk -F "," '$17 >= 1 {print $6}' > out.csv
  • Sort and deduplicate a list, piping the output from sort directly into uniq for deduping:
sort out.csv | uniq -u > out_sorted_deduped.csv
  • Get all matches to the deduplicated set of IDs:
awk -F "," 'FNR==NR{a[$0];next}($6 in a)' out_sorted_deduped.csv feed_20130401.csv > matches.csv