The siren song of “Big Data” lured me to Santa Clara last week for the first O’Reilly Strata conference. I’m finishing a physics PhD, and I was curious what possibilities might await someone with my background. The atmosphere was exciting: there was a feeling of great potential. Here are some of my impressions. (Other perspectives can be found here and here, with a meta-list here.)
I spent the first day in the Data Bootcamp. Since I’m coming from academic science, my conference goals were to familiarize myself with some of the tools and techniques and to survey the general landscape. Topics like k-means and k-nearest neighbors are pretty simple to grasp, but the bootcamp’s quick introduction made basic network analysis seem more approachable. The presenters emphasized that quick-and-dirty, 80% solutions are often highly effective. Joseph Adler’s presentation on Big Data focused on methods for shrinking the problem to something tractable: the overhead and expertise needed for Hadoop et al. is often overkill. Logistically, the coding exercises were difficult to get running on the fly (hard to download on conference wifi), but I’m excited to play with the extensive example code from the session.
For the second and third days, I was in the morning keynotes and then mainly the “practitioner” sessions. I wish I could have gone to a few of the visualization talks! Some of the best sessions, from my perspective:
- Mark Madsen (Third Nature) pointed out that for data to be useful, its insights must be applied within the sociology of an organization: data is political insofar as it guides choices and actions. [video]
- DJ Patil of LinkedIn has built what is by consensus one of the strongest data science teams around. They’ve got a big network full of high-quality data, and they have organized their data science group as a top-level product team so they can ship products. They launched LinkedIn Skills during the talk, which gives a great way to uncover trends, geographic clusters, key people, and related expertise for all kinds of skills. (e.g., Hadoop.) [video]
- In contrast, Flip Kromer of Infochimps talked about the realities of “Data Science on a Shoestring.” Given the high demand for data scientists, bootstrapped startups can “a) recruit experienced people at founder equity or b) hire undervalued talent and grow their own.” Lacking an ability to hire those with traditional coding chops, they look for people who “1) have the ‘get shit done’ gene 2) are passionate learners and 3) are fun to work with.” New hires can fail big, but in parallel: the organization is programmer-fault tolerant. I was interested in their hackerly solution to a human resources problem and their willingness to disregard conventional wisdom about software engineering best practices in order to implement it.
- Joseph Turian (MetaOptimize) described some exciting new algorithmic developments not yet being applied in practice. The four techniques he described (“Deep Learning,” semantic hashing, graph parallelism, and unsupervised semantic parsing) all seem to have huge potential. His MetaOptimize Q&A site is a key resource for those interested in machine learning and natural language processing. [slides]
- Hadoop is the elephant in the room for big data. The ecosystem surrounding it seems much larger than that of any other Map/Reduce implementation.
- Python (plus the Numpy/Scipy/Matplotlib stack) is used surprisingly frequently: as a general lingua franca, for glue code, and for end-to-end analysis of moderately-sized data.
- R is a favorite as well, particularly for fancy math/stats.
- DataWrangler is an impressive tool debuted at the conference by Joe Hellerstein of U.C. Berkeley. It simplifies the often-painful process of munging data into a usable form by providing interactive manipulation of the source file and live previews of the transformations. Pointing to a particularly mal-formed file, Hellerstein said, “You have PhDs spending time turning this into a matrix.” DataWrangler should speed that process.
- “clean data > more data > fancier math… which is sad, because fancy math is awesome.” —Hilary Mason
- “Asking to move or correct your data is like wishing to be invisible: it would be cool, but we haven’t learned how.” —Tim O’Reilly
I came away from the conference with great enthusiasm for the potential of the data business. For those of us who are both technical and quantitative, there are lots of opportunities. There’s lots to learn but plenty of resources with which to do so. Watch this space for my own explorations!
According to DJ Patil, chief scientist at LinkedIn (@dpatil), the best data scientists tend to be “hard scientists,” particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem. When you’ve just spent a lot of grant money generating data, you can’t just throw the data out if it isn’t as clean as you’d like. You have to make it tell its story. You need some creativity for when the story the data is telling isn’t what you think it’s telling.
Back  as well as the business of data, and data in business, and business data…