{"id":14,"date":"2011-02-15T13:10:30","date_gmt":"2011-02-15T21:10:30","guid":{"rendered":"http:\/\/bellm.org\/blog\/?p=14"},"modified":"2011-02-15T13:10:30","modified_gmt":"2011-02-15T21:10:30","slug":"strata-2011","status":"publish","type":"post","link":"http:\/\/bellm.org\/blog\/2011\/02\/15\/strata-2011\/","title":{"rendered":"Strata 2011"},"content":{"rendered":"<p>The siren song of &#8220;Big Data&#8221; lured me to Santa Clara last week for the first <a href=\"http:\/\/strataconf.com\/strata2011\">O&#8217;Reilly Strata<\/a> conference.? I&#8217;m finishing a physics PhD, and I was curious what possibilities might await someone with my background<a id=\"refX\" href=\"#X\"><sup>[1]<\/sup><\/a>.? The atmosphere was exciting: there was a feeling of great potential.? Here are some of my impressions.? (Other perspectives can be found <a href=\"http:\/\/www.sauria.com\/blog\/2011\/02\/07\/strata-2011\/\">here<\/a> and <a href=\"http:\/\/mkaz.com\/archives\/1550\/strata-data-conference-recap\/\">here<\/a>, with a meta-list <a href=\"http:\/\/lanyrd.com\/2011\/strata\/writeups\/\">here<\/a>.)<\/p>\n<h3>Sessions<\/h3>\n<p>I spent the first day in the Data Bootcamp.? Since I&#8217;m coming from academic science, my conference goals were to familiarize myself with some of the tools and techniques and to survey the general landscape.? Topics like <em>k<\/em>-means and <em>k<\/em>-nearest neighbors are pretty simple to grasp, but the bootcamp&#8217;s quick introduction made basic network analysis seem more approachable.? The presenters emphasized that quick-and-dirty, 80% solutions are often highly effective.? <a href=\"http:\/\/twitter.com\/jadler\">Joseph Adler&#8217;s<\/a> presentation on Big Data focused on methods for shrinking the problem to something tractable: the overhead and expertise needed for Hadoop <em>et al.<\/em> is often overkill. Logistically, the coding exercises were difficult to get running on the fly (hard to download on conference wifi), but I&#8217;m excited to play with the extensive <a href=\"https:\/\/github.com\/drewconway\/strata_bootcamp\">example code<\/a> from the session.<\/p>\n<p>For the second and third days, I was in the morning keynotes and then mainly the &#8220;practitioner&#8221; sessions.? I wish I could have gone to a few of the visualization talks!? Some of the best sessions, from my perspective:<\/p>\n<ul>\n<li><a href=\"http:\/\/twitter.com\/markmadsen\">Mark Madsen<\/a> (<a href=\"http:\/\/thirdnature.net\/\">Third Nature<\/a>) pointed out that for data to be useful, its insights must be applied within the sociology of an organization: data is political insofar as it guides choices and actions. [<a href=\"http:\/\/www.youtube.com\/watch?v=HwVPxYWDO4w\">video<\/a>]<\/li>\n<li><a href=\"http:\/\/twitter.com\/dpatil\">DJ Patil<\/a> of LinkedIn has built what is by consensus one of the strongest data science teams around.? They&#8217;ve got a big network full of high-quality data, and they have organized their data science group as a top-level product team so they can ship products.? They launched <a href=\"http:\/\/www.linkedin.com\/skills\/\">LinkedIn Skills<\/a> during the talk, which gives a great way to uncover trends, geographic clusters, key people, and related expertise for all kinds of skills.? (e.g., <a href=\"http:\/\/www.linkedin.com\/skills\/skill\/Hadoop\">Hadoop<\/a>.) [<a href=\"http:\/\/www.youtube.com\/watch?v=NXgS1ZQ-Uw0\">video<\/a>]<\/li>\n<li>In contrast, <a href=\"http:\/\/twitter.com\/mrflip\">Flip Kromer<\/a> of Infochimps talked about the realities of &#8220;Data Science on a Shoestring.&#8221;? Given the high demand for data scientists, bootstrapped startups can &#8220;a) recruit experienced people at founder equity or b) hire undervalued talent and grow their own.&#8221;? Lacking an ability to hire those with traditional coding chops, they look for people who &#8220;1) have the &#8216;get shit done&#8217; gene 2) are passionate learners and 3) are fun to work with.&#8221;? New hires can fail big, but in parallel: the organization is programmer-fault tolerant.? I was interested in their hackerly solution to a human resources problem and their willingness to disregard conventional wisdom about software engineering best practices in order to implement it.<\/li>\n<li><a href=\"http:\/\/twitter.com\/turian\">Joseph Turian<\/a> (<a href=\"http:\/\/metaoptimize.com\/\">MetaOptimize<\/a>) described some exciting new algorithmic developments not yet being applied in practice.? The four techniques he described (&#8220;Deep Learning,&#8221; semantic hashing, graph parallelism, and unsupervised semantic parsing) all seem to have huge potential.? His <a href=\"http:\/\/metaoptimize.com\/qa\/\">MetaOptimize Q&amp;A<\/a> site is a key resource for those interested in machine learning and natural language processing. [<a href=\"http:\/\/assets.en.oreilly.com\/1\/event\/55\/New%20Developments%20in%20Large%20Data%20Techniques%20Presentation.pdf\">slides<\/a>]<\/li>\n<\/ul>\n<h3>Tools<\/h3>\n<ul>\n<li>Hadoop is the elephant in the room for big data.? The ecosystem surrounding it seems much larger than that of any other Map\/Reduce implementation.<\/li>\n<li>Python (plus the Numpy\/Scipy\/Matplotlib stack) is used surprisingly frequently: as a general <em>lingua franca<\/em>, for glue code, and for end-to-end analysis of moderately-sized data.<\/li>\n<li>R is a favorite as well, particularly for fancy math\/stats.<\/li>\n<li><a href=\"http:\/\/vis.stanford.edu\/wrangler\/\">DataWrangler<\/a> is an impressive tool debuted at the conference by <a href=\"http:\/\/db.cs.berkeley.edu\/jmh\/\">Joe Hellerstein<\/a> of U.C. Berkeley.? It simplifies the often-painful process of munging data into a usable form by providing interactive manipulation of the source file and live previews of the transformations.? Pointing to a particularly mal-formed file, Hellerstein said, &#8220;You have PhDs spending time turning this into a matrix.&#8221;? DataWrangler should speed that process.<\/li>\n<\/ul>\n<h3>One-liners<\/h3>\n<ul>\n<li>&#8220;clean data &gt; more data &gt; fancier math&#8230; which is sad, because fancy math is awesome.&#8221;? &#8212;<a href=\"http:\/\/twitter.com\/hmason\">Hilary Mason<\/a><\/li>\n<li>&#8220;Asking to move or correct your data is like wishing to be invisible: it would be cool, but we haven&#8217;t learned how.&#8221;? &#8212;<a href=\"http:\/\/twitter.com\/timoreilly\">Tim O&#8217;Reilly<\/a><\/li>\n<\/ul>\n<h3>Summary<\/h3>\n<p>I came away from the conference with great enthusiasm for the potential of the data business<a id=\"refY\" href=\"#Y\"><sup>[2]<\/sup><\/a>.? For those of us who are both technical and quantitative, there are lots of opportunities.  There&#8217;s lots to learn but plenty of resources with which to do so.  Watch this space for my own explorations!<\/p>\n<p><a id=\"X\" href=\"#refX\"><!--more-->Back<\/a> <sup>[1]<\/sup> The excellent overview &#8220;<a href=\"http:\/\/radar.oreilly.com\/2010\/06\/what-is-data-science.html\">What is Data Science?<\/a>&#8221; contains the following, which corresponds to my experience:<\/p>\n<blockquote><p>According to DJ Patil, chief scientist at <a href=\"http:\/\/www.linkedin.com\/\">LinkedIn<\/a> (<a href=\"http:\/\/twitter.com\/dpatil\">@dpatil<\/a>),  the best data scientists tend to be &#8220;hard scientists,&#8221; particularly  physicists, rather than computer science majors. Physicists have a  strong mathematical background, computing skills, and come from a  discipline in which survival depends on getting the most from the data.   They have to think about the big picture, the big problem. When you&#8217;ve  just spent a lot of grant money generating data, you can&#8217;t just throw  the data out if it isn&#8217;t as clean as you&#8217;d like.  You have to make it  tell its story.  You need some creativity for when the story the data is  telling isn&#8217;t what you think it&#8217;s telling.<\/p><\/blockquote>\n<p><a id=\"Y\" href=\"#refY\">Back<\/a> <sup>[2]<\/sup> as well as the business of data, and data in business, and business data&#8230;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The siren song of &#8220;Big Data&#8221; lured me to Santa Clara last week for the first O&#8217;Reilly Strata conference.? I&#8217;m finishing a physics PhD, and I was curious what possibilities might await someone with my background[1].? The atmosphere was exciting: there was a feeling of great potential.? Here are some of my impressions.? (Other perspectives [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[3],"tags":[],"class_list":["post-14","post","type-post","status-publish","format-standard","hentry","category-data"],"_links":{"self":[{"href":"http:\/\/bellm.org\/blog\/wp-json\/wp\/v2\/posts\/14","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/bellm.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/bellm.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/bellm.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/bellm.org\/blog\/wp-json\/wp\/v2\/comments?post=14"}],"version-history":[{"count":0,"href":"http:\/\/bellm.org\/blog\/wp-json\/wp\/v2\/posts\/14\/revisions"}],"wp:attachment":[{"href":"http:\/\/bellm.org\/blog\/wp-json\/wp\/v2\/media?parent=14"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/bellm.org\/blog\/wp-json\/wp\/v2\/categories?post=14"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/bellm.org\/blog\/wp-json\/wp\/v2\/tags?post=14"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}