/data/universe/

June 3, 2011

Visualizing Social Networks I: LinkedIn

Part 1 of 3

Humans are ubiquitously social animals. Even our identities are socially informed: when asked to describe ourselves, many of us would mention familial relationships (“a husband,” “a mother,” “a sister”), the culture we are from, or the professional community implied by our work. John Donne’s famous quote “No man is an island” (from his Meditation XVII) evokes this rich interconnection. Networks form from friendships, work relationships, academic citations, web links, email and phone communications, and many other relationships. Today, many social networks—with and without “real-world” analogues—are present online.

Social network analysis provides insight into such communities. Using a detailed view of the relationships (“edges”) between individuals (“nodes”), mathematical tools enable one to assess the how tightly-knit the network is, identify influential or anomalous individuals, and find smaller communities. Network analysis is widely used in the social web to increase the relevance of the information displayed to us (e.g., Facebook’s EdgeRank). However, today both the tools and the network data themselves are also available to non-professionals.

In this three-part series, I generate and compare network plots for three social networks in which I participate: LinkedIn, Facebook, and Twitter. As we will see, even the basic step of making network maps allows us to make inferences about the potential utility of the networks.

I joined LinkedIn earlier this year as I completed my PhD, and my network on it is not large. However, the network size is in this case not a function of how long I’ve used the network. LinkedIn’s data science team is outstanding, and their People You May Know feature rapidly identified my contacts. However, there are presently not very many people I know using the service.

The plot above is known as an “ego network,” potentially because it makes me look like the most important person in it. It consists of me, all my contacts, and the connections between those contacts. I generated this map with LinkedIn’s InMap feature. (It’s not possible to extract this data, as the LinkedIn API does not give access to one’s contacts’ connections.)

In contrast with the Facebook graph to come, the LinkedIn graph is sparse, with one large and several smaller, weakly-connected cliques. InMaps has attempted to identify communities within the data, coding them by color. Here, the names of my contacts are removed for privacy, but the large blue clique is friends of mine from a graduate school club. Except for a handful of family and hometown friends, the remainder are my college friends. However, there are not enough intermediate ties between them to join them in a single clique. This LinkedIn graph is essentially a subset of my Facebook graph, where we will see more distinct clustering.

Networking for job-seeking famously depends on weak ties, and LinkedIn is set up to provide introductions to people through your connections. The representation of my social circles in LinkedIn is small, however, limiting the reach of the network.

(This basic “look! there’s a clique!” analysis is less data science than force-directed pastafarianism, but it’s the first step for a beginner on the journey to more quantitative analyses.)

Continue to Part 2.

1 comment

June 1, 2011

Python for Astronomy Follow-Up

My post outlining the reasons why I think Python should be the language of choice in astronomy today got a lot more attention than I expected thanks to discussion on Hacker News and a link from Guido himself. Psychological factors encourage particular attachment to one’s tool choices, but the discussion has been constructive.

Some commenters noted that Python adoption by astronomers is widespread and accelerating. Certainly the community is larger and the available libraries more mature today than when I transitioned to Python in 2008. Having recently seen new students being encouraged to develop in IDL, however, I think there’s a need for advocacy to capitalize on the momentum of astronomical Python.

Other researchers, mostly in other fields, made the case for their own preferred languages. In my view, none are as suitable today as Python for adoption in astronomy.

As scripting languages, Perl and Ruby are similar to Python. Perl’s syntax is not particularly beginner friendly, though, and it can be a “write-only” language–poor for scientific collaboration. For historical reasons, there are currently fewer astronomical libraries in Ruby.
MATLAB shares the major disadvantages of IDL: it’s a proprietary language built around an array data type. While widely used in engineering, it has little installed base in astronomy.
For those in high-energy particle physics, C++ and the ROOT libraries are a necessity. Having worked extensively with Monte Carlo software built on this stack, I can attest that their power comes with a steep learning curve. I suspect even students planning to work with accelerator-scale data would be well-served if they started learning programming with basic Python scripting.

A comment on Hacker News summarized nicely the strength of Python as an all-purpose language gluing together scientific analysis:

SciPy/NumPy, PyROOT, PyFITS are all unbeatable tools for anything in Physics or Astronomy as far as I’m concerned. Throw some knowledge of C in with that, and you can scale anything up to supercomputing clusters or back down to your laptop, and that’s a very important, powerful thing for a scientist.

comments closed

May 27, 2011

Why Astronomers Should Program in Python

The training and career outcomes of astronomy students make Python the current best-choice language for new development and analysis scripting. Two realities about academic astronomy allow us to evaluate the success of language choices, and Python is a clear winner.

Astronomers do not receive any formal training in programming, computer science, or “software carpentry.” While practicing astronomers spend significant time writing code for analysis, few undergraduate or graduate programs require even a semester of instruction in basic programming. A 2008 survey of scientists found that while they spent 30% of their time developing software, 97% are self-taught programmers [Hannay et al. 2009]. (The situation is unfortunately quite similar with statistics.) Instead, programming knowledge is passed down informally within research groups, limiting the development of true expertise.
Astronomers need a language which is beginner-friendly, yet powerful.
Most astronomy students do not continue into long-term careers as astronomers. While most astronomy PhDs can obtain postdoctoral positions, the number of permanent academic positions available is far lower.
Students’ career prospects outside of astronomy will be improved if they have experience with a language used in other fields.

Of course, the language of choice should also enable the best science possible under time and cost constraints. (These arguments apply for other fields of science to the degree that #1 and 2 are valid.)

Many astronomers currently program in IDL, a proprietary array-based language used mainly in astronomy, geophysics, and medical imaging. Python is clearly better than IDL in terms of power and widespread adoption, but I believe its beginner-friendliness makes Python a better choice than other potential “primary” languages.

Some of Python’s advantages:

It’s beginner-friendly. Python code is usually straightforward to read, and the language goals focus on clarity. As an interpreted language, students can learn the language syntax interactively without waiting for compilation, greatly speeding the learning process. Basic tutorials are widely available for free on the web, and many questions have answers a quick Google or Stack Overflow search away. Python is free of the memory allocation problems and tricky pointer arithmetic one encounters in C or C++ that can confound the beginner. Finally, as free software, it’s straightforward to install Python on one’s personal computer without the challenges of license files and authentication.
Python is a language to grow into. Despite being beginner friendly, Python is not a lightweight language. Experience with Python exposes a student to techniques in object-oriented and functional programming styles and introduces a variety of data structures. Debugging tools and unit testing frameworks are readily available.
With “batteries included,” Python enables powerful analyses. The Python ecosystem is enormous, including both native libraries and interface layers to other packages which allow scientists to leverage others’ work. With libraries for array manipulation, scientific programming, 3D and 2D plotting, numerical methods, web programming, database integration, interfaces to C and FORTRAN code, GUI programming, symbolic math, machine learning, MCMC, network analysis, and much more, the possibilities are limited by one’s imagination.
Python is widely used in science and industry. The main Python site provides many examples.

Obviously, Python is not the right choice in all circumstances. If an analysis needs to interface with a large body of existing code, it generally makes sense to work in that language. Similarly, performance constraints may require a compiled language in some cases. Senior astronomers driven by deadlines will want to stick with languages they already know rather than fight with a new one. In general, though, astronomy would be well-served if most new code were written in Python.

While it is easier to train new students using a language one knows, for the reasons above new students should be encouraged to learn Python, even when that creates an “impedance mismatch” between the advisor and the student. Astronomical use of Python is growing rapidly, and groups fluent in it will have an advantage.

Astronomy-specific Python code and guides are proliferating. The CfA has a tutorial, and there are other resources here, here, here, and here. Python interfaces to key tools like IRAF, ds9, and FITS exist, as do translation guides for the IDL Astronomy Library.

Better training and career guidance for students of astronomy are important long-range goals for astronomy departments. Here and now, astronomy students can individually improve their prospects and their science by programming in Python.

Update, 6/1: following up some comments.

Read the rest of this entry »

16 comments

February 15, 2011

Strata 2011

The siren song of “Big Data” lured me to Santa Clara last week for the first O’Reilly Strata conference. I’m finishing a physics PhD, and I was curious what possibilities might await someone with my background^[1]. The atmosphere was exciting: there was a feeling of great potential. Here are some of my impressions. (Other perspectives can be found here and here, with a meta-list here.)

Sessions

I spent the first day in the Data Bootcamp. Since I’m coming from academic science, my conference goals were to familiarize myself with some of the tools and techniques and to survey the general landscape. Topics like k-means and k-nearest neighbors are pretty simple to grasp, but the bootcamp’s quick introduction made basic network analysis seem more approachable. The presenters emphasized that quick-and-dirty, 80% solutions are often highly effective. Joseph Adler’s presentation on Big Data focused on methods for shrinking the problem to something tractable: the overhead and expertise needed for Hadoop et al. is often overkill. Logistically, the coding exercises were difficult to get running on the fly (hard to download on conference wifi), but I’m excited to play with the extensive example code from the session.

For the second and third days, I was in the morning keynotes and then mainly the “practitioner” sessions. I wish I could have gone to a few of the visualization talks! Some of the best sessions, from my perspective:

Mark Madsen (Third Nature) pointed out that for data to be useful, its insights must be applied within the sociology of an organization: data is political insofar as it guides choices and actions. [video]
DJ Patil of LinkedIn has built what is by consensus one of the strongest data science teams around. They’ve got a big network full of high-quality data, and they have organized their data science group as a top-level product team so they can ship products. They launched LinkedIn Skills during the talk, which gives a great way to uncover trends, geographic clusters, key people, and related expertise for all kinds of skills. (e.g., Hadoop.) [video]
In contrast, Flip Kromer of Infochimps talked about the realities of “Data Science on a Shoestring.” Given the high demand for data scientists, bootstrapped startups can “a) recruit experienced people at founder equity or b) hire undervalued talent and grow their own.” Lacking an ability to hire those with traditional coding chops, they look for people who “1) have the ‘get shit done’ gene 2) are passionate learners and 3) are fun to work with.” New hires can fail big, but in parallel: the organization is programmer-fault tolerant. I was interested in their hackerly solution to a human resources problem and their willingness to disregard conventional wisdom about software engineering best practices in order to implement it.
Joseph Turian (MetaOptimize) described some exciting new algorithmic developments not yet being applied in practice. The four techniques he described (“Deep Learning,” semantic hashing, graph parallelism, and unsupervised semantic parsing) all seem to have huge potential. His MetaOptimize Q&A site is a key resource for those interested in machine learning and natural language processing. [slides]

Tools

Hadoop is the elephant in the room for big data. The ecosystem surrounding it seems much larger than that of any other Map/Reduce implementation.
Python (plus the Numpy/Scipy/Matplotlib stack) is used surprisingly frequently: as a general lingua franca, for glue code, and for end-to-end analysis of moderately-sized data.
R is a favorite as well, particularly for fancy math/stats.
DataWrangler is an impressive tool debuted at the conference by Joe Hellerstein of U.C. Berkeley. It simplifies the often-painful process of munging data into a usable form by providing interactive manipulation of the source file and live previews of the transformations. Pointing to a particularly mal-formed file, Hellerstein said, “You have PhDs spending time turning this into a matrix.” DataWrangler should speed that process.

One-liners

“clean data > more data > fancier math… which is sad, because fancy math is awesome.” —Hilary Mason
“Asking to move or correct your data is like wishing to be invisible: it would be cool, but we haven’t learned how.” —Tim O’Reilly

Summary

I came away from the conference with great enthusiasm for the potential of the data business^[2]. For those of us who are both technical and quantitative, there are lots of opportunities. There’s lots to learn but plenty of resources with which to do so. Watch this space for my own explorations!

Read the rest of this entry »

comments closed

February 11, 2011

I am a part of all that I have met

No man is an Iland, intire of it selfe; every man is a peece of the Continent, a part of the maine; if a Clod bee washed away by the Sea, Europe is the lesse, as well as if a Promontorie were, as well as if a Mannor of thy friends or of thine owne were; any mans death diminishes me, because I am involved in Mankinde; And therefore never send to know for whom the bell tolls; It tolls for thee.
–John Donne, Meditation XVII

comments closed

Newer »

/data/universe/

June 3, 2011

Visualizing Social Networks I: LinkedIn

June 1, 2011

Python for Astronomy Follow-Up

May 27, 2011

Why Astronomers Should Program in Python

February 15, 2011

Strata 2011

Sessions

Tools

One-liners

Summary

February 11, 2011

I am a part of all that I have met

Elsewhere