/data/universe/

Category: Data

Bitten by Sample Selection Bias

At this week’s meeting of the High Energy Astrophysics Division of the American Astronomical Society, I learned that one of the teams analyzing data from a NASA observatory had run into trouble with their machine learning classifier.  Their problem illustrates one of the particular challenges of machine learning in astronomy: sample selection bias. At first […]

The Data Science Core Curriculum

Jake Klamka spoke at Caltech a few months back about his Insight Data Science Fellowship–a program designed to help science PhDs transition into jobs in data science.  The program guides scientists in packaging skills they already have so that employers can easily see the relevance and value to their business.  Jake’s own initial difficulty getting […]

Tracing the Changing State of the Union with Text Analysis

U.S. Presidents since George Washington have delivered State of the Union addresses each year to describe the nation’s condition and prioritize future action.  Can we glean historical patterns from the texts?  Do presidents speak similarly in times of war or depression?  Do Republicans and Democrats emphasize different words?  How does the evolution of American English […]

How Big is the Market for Big Data?

I am bullish on the potential of increasingly pervasive data storage and analysis (one sense of “Big Data”) to improve outcomes in business, government, education, and our personal lives.  The cost of storage is plummeting (though providing useful access to that data has a nontrivial cost).  Faster computers, better algorithms, and increasingly experienced data scientists […]

Arguing to the Algorithm: Machine-Learned Scoring of Student Essays

This week marks the end of the Kaggle Automated Essay Scoring Competition.  I participated as a way to build my machine learning skills after learning the basics in Andrew Ng’s online class. The goal of the competition was to develop algorithms that could automatically score student essays for standardized achievement tests.  Kaggle (and the sponsoring […]

A Role for Public Data Competitions in Scientific Research?

The public data challenge has emerged as one response to the need for sophisticated data analysis in many sectors.  The prototypical example of these competitions is the Netflix Prize, which awarded $1M for improved predictions of user movie ratings1. Kaggle provides a platform for organizations to sponsor their own challenges.  The most high-profile is currently […]

Visualizing Social Networks III: Twitter

Part 3 of 3.  Return to Part 2. The Twitter network differs from Facebook and LinkedIn because it does not  require relationships to be reciprocal.  Accordingly, I can follow users whose updates I find interesting or valuable without any expectation that they will do the same.  (In network parlance, this creates a “directed” graph, in […]

Visualizing Social Networks II: Facebook

Part 2 of 3.  Return to Part 1. As with LinkedIn, the graph of friendships in Facebook generally corresponds to relationships established in the real world.  Due in part to its more broad-based entertainment appeal, Facebook presently has about six times more registered and active users. I joined Facebook in early 2004 while I was […]

Visualizing Social Networks I: LinkedIn

Part 1 of 3 Humans are ubiquitously social animals.  Even our identities are socially informed: when asked to describe ourselves, many of us would mention familial relationships (“a husband,” “a mother,” “a sister”), the culture we are from, or the professional community implied by our work.  John Donne’s famous quote “No man is an island” […]

Strata 2011

The siren song of “Big Data” lured me to Santa Clara last week for the first O’Reilly Strata conference.  I’m finishing a physics PhD, and I was curious what possibilities might await someone with my background[1].  The atmosphere was exciting: there was a feeling of great potential.  Here are some of my impressions.  (Other perspectives […]