Exploring the Data Universe

IDL Magics for the IPython Notebook

The IPython Notebook combines code, documentation, and computational results in one package that’s easy to share.  It’s proving a great way to teach, as notebooks are easy for the instructor to write and for students to modify.  Notebooks also provide seamless integration with other programming environments through extensions providing “magic functions“: short code prefixes beginning with % or %% that call other interpreters, like R or Octave.

Since many astronomers who might want to transition to Python already know IDL, I wanted to provide an %idl magic function for IPython so they could easily call existing code.  The difficult interfacing between IDL and Python was already done by Anthony Smith‘s pIDLy package.  (As a bonus, it supports the free GDL interpreter as well.)  I adapted the Octave magic code to provide %idl magic functions with consistent syntax.  You can find the code on github as ipython-idlmagic.

The demonstration notebook below is available on github as well, or you can view it with nbviewer.

Installation

To begin, we install pIDLy:

pip install pidly

Then we install idlmagic:

In [ ]:
%install_ext https://raw.github.com/ebellm/ipython-idlmagic/master/idlmagic.py

Usage

When starting a new notebook, we load the magic:

In [1]:
%load_ext idlmagic
IDL not found, using GDL

(I am using GDL rather than IDL on this computer. idlmagic will first look for the idl interpreter on the search path and and fall back to gdl if needed.)

Line magics

The %idl magic enables one-line execution of IDL commands in the IPython interpreter or notebook:

In [2]:
%idl print, findgen(5)
      0.00000      1.00000      2.00000      3.00000      4.00000

(Note that the %idl line magic fails with TypeError: coercing to Unicode: need string or buffer, dict found in current release versions of IPython (0.13.2 and below) due to a known bug; the github development version of IPython works as expected.)

Cell magics

Multi-line input can be entered with the %%idl cell magic:

In [3]:
%%idl
x = findgen(5)
y = x^2.
; comments are supported
print, $ ; as are line continuations
mean(y)
% Compiled module: MEAN.
      6.00000

Passing variables between Python and IDL

The mechanisms for passing variables to and from IDL are based on those in the built-in %R and %octave magics.

Variables may be pushed from Python into IDL with %idl_push:

In [4]:
msg = '  padded   string   '
import numpy as np
arr = np.arange(5)
In [5]:
%idl_push msg arr
In [6]:
%%idl
print, strcompress(msg,/REMOVE_ALL)
print, reverse(arr)
paddedstring
              4                     3                     2
              1                     0

Similarly, variables can be pulled from IDL back to Python with %idl_pull:

In [7]:
%idl arr += 1
In [8]:
%idl_pull arr
In [9]:
arr
Out[9]:
array([1, 2, 3, 4, 5])

Variables can also be pushed and pulled from IDL inline using the -i (or --input) and -o (or --output) flags:

In [10]:
Z = np.array([1, 4, 5, 10])
In [11]:
%idl -i Z -o W W = sqrt(Z)
In [12]:
W
Out[12]:
array([ 1.        ,  2.        ,  2.23606801,  3.1622777 ])

Plotting

Inline plots are displayed automatically by the IPython notebook. IDL Direct graphics are used. The optional -s width,height argument (or --size; default: 600,375) specifies the size of the resulting png image.

In [13]:
%%idl -s 400,400
plot,findgen(10),xtitle='X',ytitle='Y'
% Compiled module: WRITE_PNG.

Known issues and limitations

  • The %idl line magic fails with TypeError: coercing to Unicode: need string or buffer, dict found in current release versions of IPython (0.13.2 and below) due to a known bug; the github development version of IPython works as expected.
  • Only one plot can be rendered per cell
  • Processing for possibly unused plot output slows execution
  • Scalar variables from IDL may be returned as single-element Numpy arrays

The plotting capabilities are rather kludgy due to IDL’s old-school graphics routines.  I opted to implement Direct Graphics, which are the lowest common denominator supported by IDL and GDL.  Since I have to initialize the device before the plot call and close it afterwards, the %idl magic can only produce one plot output per notebook cell.  The plotting code produces overhead on non-plotting lines as well, unfortunately; I chose syntactic simplicity over execution speed for the time being.

Stay tuned to see it in action!

Bitten by Sample Selection Bias

At this week’s meeting of the High Energy Astrophysics Division of the American Astronomical Society, I learned that one of the teams analyzing data from a NASA observatory had run into trouble with their machine learning classifier.  Their problem illustrates one of the particular challenges of machine learning in astronomy: sample selection bias.

All-sky map of Fermi sources (Nolan et al. 2012).

All-sky map of Fermi sources (Nolan et al. 2012).

At first glance, it looks like a classic machine learning problem.  The Fermi Gamma-Ray Space Telescope has performed a highly sensitive all-sky survey in high-energy gamma-rays and detected nearly two thousand sources, most of which are unidentified.  It’s expected that most of these will either be pulsars–rapidly-rotating neutron stars–or active galactic nuclei (AGN), accreting supermassive black holes in distant galaxies.  Many teams (including one I work with) are interested in detecting the Fermi pulsars in radio or optical bands.  We’re interested because many of these pulsars proving to be rare “black widow” systems, where the pulsar in the process of destroying its low-mass companion and it is possible to determine the mass of the neutron star.  However, telescope time is precious, so there needed to be a way to prioritize which systems to follow up first.

The pulsars and AGN are distinguished by their variability in the gamma-ray band and by the presence or absence of curvature in their spectra.  It seems obvious, then, to train a classifier on the data for the known systems and then use it to predict the class of the unknown systems, and several groups did just that.

Spectral curvature vs variability for Fermi sources by source class (Nolan et al 2012).

Spectral curvature vs variability for Fermi sources by source class (Nolan et al. 2012).

However, when the Pulsar Search Consortium searched the newly-prioritized list, they found fewer new pulsars, not more.  The problem turned out to be a bias in the training set: the known sources used for training are brighter than the unknown ones being predicted.  In this case, the major effect was that the fit of the spectral curvature was inconclusive for dim sources–there weren’t enough counts to tell a curved line from a straight one, so the features fed into the classifier weren’t reliable.  As usual, garbage in, garbage out.

In astronomy, this situation is quite frequent: the “labelled data” you’d like to use to train a classifier on is often systematically different from that which you’re trying to classify.  It may be brighter, nearer, or observed with a different instrument, but in any case blindly extrapolating to new regimes is unlikely to yield reliable results.  Joey Richards and his collaborators describe this problem of sample selection bias in great detail in their excellent paper, where they focus on the challenge of classifying different types of variable stars from their light curves.  They find that iterative Active Learning approaches are most effective at developing reliable classifiers when the unlabelled and labelled data may be drawn from different populations.  In Active Learning, the classifier identifies the unlabelled instances whose classification would most improve the performance of the classifier as a whole.  These targets can then be followed up in detail, classified, and the process repeated.

This approach worked well for variable star problem, where the features used in the classifier were valid for all the sources.  For the Fermi problem, the challenge is that one of the most informative parameters is unreliable for a subset of the sources.  In this case it might be more useful to develop additional features that might identify spectral characteristics even in the low-flux regime.

Making Space in LaTeX Documents

A recent major proposal deadline gave me a chance to brush up on my LaTeX skills. As a rule, it’s better to make your proposal more concise than to play formatting tricks to squeeze more text in. For this proposal, though, I needed the big guns–for some sections the instructions alone were a significant fraction of the allotted space! Below are some tested methods for cramming more material into your page limit:

  1. Choose your text size. As a first step, be sure your text is set to the minimum allowed point size.
    \documentclass[11pt]{article}
  2. Get the margins right. The geometry package provides the easiest means to specify margins.
    \usepackage[paper=letterpaper, margin=1in, 
           nohead, pdftex]{geometry}

    If there aren’t firm margin requirements, the fullpage package is an alternative:

    \usepackage[cm]{fullpage}
  3. Compress your lists. Standard LaTeX list environments leave lots of whitespace between the items. The paralist package provides compactitem and compactenum, drop-in replacements for itemize and enumerate.
    \usepackage{paralist}
    
    \begin{compactitem}
    \item Item text.
    \end{compactitem}
  4. Use runin headers. If your document is of any length, it’s helpful to organize it into parts, sections, subsections, and possibly even subsubsections. Standard LaTeX classes give each of these large headings on their own lines. The titlesec package provides an alternative: run-in headers. These appear in-line in the text, saving space. You can adjust the format and numbering with the \titleformat command. The commands below set up small-caps part headers on their own lines (“hang”), and variously sized run-in bold headings for sections.
    \usepackage[compact,medium]{titlesec}
    \titleformat{\part}[hang]
    {\Large\scshape}{}{0pt}{}
    \titleformat{\section}[runin]
    {\large\fontseries{b}\selectfont\filright}{}{0pt}{}
    \titleformat{\subsection}[runin]
    {\normalfont\fontseries{b}\selectfont\filright}{}{0pt}{}
    \titleformat{\subsubsection}[runin]
    {\fontshape{it}\selectfont\filright}{}{0pt}{}
  5. Use superscript citations. If your field allows it, no citation notation uses less space than superscripted numbers. The natbib package makes it easy.
    \usepackage[super,sort&compress]{natbib}
    \bibpunct{}{}{,}{s}{}{,}
  6. Put captions beside floats. Figures and tables can end up with unused whitespace on the sides. The excellent LaTeX Wikibook provides several suggestions, including using the wrapfig package to wrap the text or using a minipage or the sidecap package to move the caption beside the text. I have used floatrow to accomplish the same task:
    \usepackage{floatrow}
    
    \begin{figure}
    \floatbox[{\capbeside\thisfloatsetup{capbesideposition={right,center},capbesidewidth=0.46\textwidth}}]{figure}[\FBwidth]
    {\caption{
    {\small
    Caption text here.
    \label{fig:myfig}
    }}}
    {\includegraphics[width=0.52\textwidth]{myfig.pdf}}
    \end{figure}
  7. Single-space the bibliography. The code below removes the bibliography section label and single-spaces the entries: put it in the document header.
    % no title on bibliography header: this duplicates the section of
    % article.cls, removing the refname section
    \makeatletter
    \renewenvironment{thebibliography}[1]{%
    \list{\@biblabel{\@arabic\c@enumiv}}%
    {\settowidth\labelwidth{\@biblabel{#1}}%
    \leftmargin\labelwidth
    \advance\leftmargin\labelsep
    \@openbib@code
    \usecounter{enumiv}%
    \let\p@enumiv\@empty
    \renewcommand\theenumiv{\@arabic\c@enumiv}}%
    \sloppy
    \clubpenalty4000
    \@clubpenalty \clubpenalty
    \widowpenalty4000%
    \sfcode`\.\@m}
    {\def\@noitemerr
    {\@latex@warning{Empty `thebibliography' environment}}%
    \endlist}
    \makeatother
    
    % make bibliography single-spaced
    \let\oldthebibliography=\thebibliography
    \let\endoldthebibliography=\endthebibliography
    \renewenvironment{thebibliography}[1]{%
    \begin{oldthebibliography}{#1}%
    \setlength{\parskip}{0ex}%
    \setlength{\itemsep}{0ex}%
    }%
    {%
    \end{oldthebibliography}%
    }
  8. Use or make a compact bibliography style. Exclude anything you can from the reference list. I made a BibTeX style file which defaults to “et al.” anytime there is more than one author.
  9. Discourage floats from getting their own pages. LaTeX uses a number of numeric weights to calculate where to position floats. Fiddling these parameters in the document header will encourage LaTeX to place them closer to each other and the text.
    % discourage floats from getting their own page
    \renewcommand\floatpagefraction{.9}
    \renewcommand\topfraction{.9}
    \renewcommand\bottomfraction{.9}
    \renewcommand\textfraction{.1}   
    \setcounter{totalnumber}{50}
    \setcounter{topnumber}{50}
    \setcounter{bottomnumber}{50}
    
    % shrink space between/after figures:
    \setlength{\textfloatsep}{10pt plus 1.0pt minus 2.0pt}
    \setlength{\floatsep}{10pt plus 1.0pt minus 2.0pt}
    \setlength{\intextsep}{10pt plus 1.0pt minus 2.0pt}
  10. Vacuum up the extra whitespace. Several more header parameters adjust white space between document elements.
    % Reduce space between section titles
    % Arguments are space before, vertical space, and space after
    \titlespacing{\part}{0pt}{*0}{2ex}
    \titlespacing{\section}{0pt}{2pt}{1ex}
    \titlespacing{\subsection}{0pt}{2pt}{1ex}
    \titlespacing{\subsubsection}{0pt}{*0}{1ex}
    
    % suck up extra white space
    \setlength{\parskip}{0pt}
    \setlength{\parsep}{0pt}
    \setlength{\headsep}{0pt}
    \setlength{\topskip}{0pt}
    \setlength{\topmargin}{0pt}
    \setlength{\topsep}{0pt}
    \setlength{\partopsep}{0pt}
  11. (Adjust or remove indentation.) Changing the paragraph indentation level can recover a few characters, but it makes the text harder to scan.
    \setlength{\parindent}{0in}
  12. (Black Hat: Shrink the inter-line spacing.) While I’m not comfortable with this measure and it makes the text hard to read, it is possible to make the line spacing less than one.
    \linespread{0.8}

As usual with LaTeX, there are multiple ways to accomplish the same goals–these are methods I personally have found convenient. The savetrees package provides an all-in-one solution which may be sufficient if you don’t want to tune the document style yourself.

The Data Science Core Curriculum

Jake Klamka spoke at Caltech a few months back about his Insight Data Science Fellowship–a program designed to help science PhDs transition into jobs in data science.  The program guides scientists in packaging skills they already have so that employers can easily see the relevance and value to their business.  Jake’s own initial difficulty getting a tech job inspired the program–after working as a particle physicist, he didn’t have any resources to help him determine which skills were important to learn and to help him get unstuck when he had problems.

There is no standard data science job, so there isn’t a standard set of skills for data scientists.  Still, Jake identified a common set of basic skills for scientists to build first as a foundation:

  • Python.  In many ways the lingua franca of programming today, Python is an excellent all-in-one tool for everything from scripting to web programming to statistical analysis.  R is a useful second language.
  • Databases.  SQL is a must.  The Hadoop ecosystem is important, but it’s probably more practical to learn on the job.
  • Visualization, particularly with d3.js. Scientists understand the importance of visualization in making sense of data; d3 is a modern, web-first framework for building interactive visualizations.  (Bonus: it helps you learn some JavaScript.)
  • Computer Science fundamentals–Most scientists have no formal CS training, but knowledge of basic algorithms and data structures is vital for working with large datasets.
  • Machine Learning–A high-level understanding of what’s possible will let you get started.

To these, I might add familiarity with the basic tools of software development (“software carpentry“), particularly version control and unit testing.  Scientists presumably have ample experience with statistics and quantitative methods.

Data science in practice encompasses a huge and growing range of tools and techniques, but this core curriculum provides a manageable start.  We live in a fortunate time–there are many excellent free online courses so you can learn these skills now, on your own time–and they’ll even be valuable in your current academic job!

Tracing the Changing State of the Union with Text Analysis

U.S. Presidents since George Washington have delivered State of the Union addresses each year to describe the nation’s condition and prioritize future action.  Can we glean historical patterns from the texts?  Do presidents speak similarly in times of war or depression?  Do Republicans and Democrats emphasize different words?  How does the evolution of American English affect the speeches?

To explore these questions, I performed textual analysis of all of the State of the Union Addresses given through 2012.  (For those interested, technical details are at the end of the post.)  I broke the addresses into words, removing words like “the” and “and” and weighting uncommon words more highly.  Then I measured the similarity of each pair of presidents by counting the words they used in common.

This figure shows the pairwise similarities for presidential State of the Union addresses–click for an interactive version.  Blue squares indicate that the presidents used few words in common, while white and red imply more overlap in vocabulary.  The red diagonal line is the similarity of each president to himself, which is obviously 100%.  I’ve color-coded the presidents by their political party.

Several interesting trends appear.  First, the dominant effect is that presidents nearer in time use more similar words.  Second, there seems to be a general separation before and after the early part of the 20th century (Wilson-Hoover): there is a great deal of overlap among the early presidents and among the post-WWII presidents, but less similarity between those groups.

We can look for bulk similarity and difference by aggregating the similarities for each president and looking at the range of values (excluding the self-comparison):

Few presidents seem to generate unusual influence on later addresses.  However, some presidents use noticeably different language than their peers, including John Adams, James Polk, Warren Harding, and George W. Bush.

Finally, each president’s words define a direction in vocabulary space.  We can project the presidents’ addresses into two dimensions, spacing the points by their relative distances in vocabulary space:

Closer presidents in this projection are more similar, so we can easily see that temporal order rather than political party seems to be the dominant effect influencing the language that presidents use in their State of the Union addresses.

Read the rest of this entry »

Python for IDL Users I: Ecosystem

Python is often the language of choice for today’s cutting-edge astronomical software.  Scientists wishing to take advantage of this powerful and growing ecosystem face the hurdle of learning a new programming language.  Thankfully, with the rapid growth of scientific Python, a number of excellent comprehensive tutorials have been developed, many particularly for astronomers:

All of these are great resources for getting up to speed in Python.  Here, I’d like to cover similar ground (installation, packaging, language features, etc.), but focusing explicitly on how Python differs from the language most astronomers know best: IDL.  My hope is that direct comparison and contrast with a familiar language will make the learning curve easier.

Installation

The installation process for IDL is more straightforward than that of Python, and that fact likely deters busy scientists from even giving Python a try [1].  IDL programs run on  a proprietary interpreter sold by RSI ITT Exelis.  Accordingly, your system administrator has to manage software licenses, so they have likely already installed IDL on public servers and maybe even set your IDL_PATH to point to the IDL Astronomy User’s Library.  In this case, you don’t install IDL at all!  Likewise, installing IDL on your laptop generally just requires a simple point-and-click installer, configuring a license file, and setting your IDL_PATH to point to appropriate libraries, which are just directories of IDL programs.

Python, in contrast, is free software that is under active development.  Moreover, the packages useful for science are not part of the core language runtime and are developed independently.  This situation is an advantage for Python insofar as it enables a broad software ecosystem, but it makes installation more complex.  (Also, Python packages often glue in existing C, C++, and FORTRAN libraries which need to compile during installation.)

One additional complication–hopefully to resolve itself naturally soon–is the Python version split.  Python 3 introduced backwards-incompatible changes to the language.  Since most of the power of Python is in the external packages, scientific migration to Python 3 has been slow.  Accordingly, for now it’s best to continue working in Python 2.7 (the current and final pre-Python 3 release) rather than Python 3.3 (the newest version).

There are several methods of installing Python and its external packages on Mac OS X and Linux; I have ordered these roughly by popularity.  (I don’t have enough experience with Python on Windows to make recommendations.)

  • Use a package manager.  In my opinion, this is the best choice for long-term Python use.  Mac OS X programs like Macports (or familiar Linux equivalents like apt-get or yum) make it easy to keep installed packages up-to-date.  These work primarily on the command line (and install non-python software as well).  You choose which packages to install (e.g., sudo port install py27-numpy), and at later times a simple sudo port selfupdate command will bring all of your libraries up to date.  The package manager handles all the tricky details of library dependencies, compilers, and paths for you.  An excellent step-by-step guide for installing astronomical python via Macports is here.
  • Use an all-in-one installer.  Many astronomers appreciate the one-click convenience provided by the Enthought Python Distribution (EPD) [2], which bundles together many (but not all) of the major scientific python libraries.  I have personally found it convenient for getting a new version of python on shared servers where I didn’t have root access.  However, EPD costs $199 for a single-user license.  While this fee supports further Python development, it means that the EPD distribution requires license files, even for the free version offered to academics.  My experience has been that this licensing complicates upgrading packages or EPD itself and negates Python’s advantage of being free to install on all of your systems.
  • Maintain your installations manually.  It’s possible–but far more complicated–to maintain your packages manually.  Manual installation isn’t beginner-friendly, though–not all programs are packaged in the same way, and you have to manage dependencies, upgrades, and paths yourself.  If you go this route, pip is currently the installation method of choice, and virtualenv is recommended to create isolated Python environments.  See the installation section of this guide for more tips.
  • Use the built-in system Python.  Not recommended!  While both Mac OS X and Linux are likely to have system versions of Python installed, they may change in system upgrades and break your code.  Moreover, the packages you install may cause conflicts with routines your computer depends on.  However, if you just want to play with the language, though, simply typing python in a terminal will give you a way to start.

Next time: IDL and Python, head to head!

Getting a new OS X Finder window in the current Space

Mac OS X provides multiple desktops via the Spaces feature.  I use this feature all the time to keep my mental workspace uncluttered: typically I have one Space with a web browser, one or two more with terminals (iTerm2 is great), and another with documents or presentations.

In Apple’s Spaces model, if you click on the icon of an already-running program in the Dock, it takes you to the Space where that application’s windows are.  For example, if I’m in Space 3 and have iTunes open in Space 1, clicking the iTunes icon will move me to Space 1.

This automatic space-switching behavior is not what I want with Finder.  When I click on the Finder in the Dock I want a new Finder window in the current space.  What happens instead is unpredictable.  If I have no Finder windows open in any space, I get what I want:  a new Finder window where I am now.  However, if I have forgotten Finder windows in other Spaces, OS X will zip me away from what I’m doing to those old Finder windows.  That unexpected switch destroys whatever sense of flow I have as I drag back to my current Space.

Two methods for getting a new Finder window in the current space don’t work well for me personally.  The first is to right click on the Finder icon in Dock and select “New Finder Window.”   The second method is to switch to Finder (either by Cmd-Tab or by clicking on the desktop), then hitting Cmd-N.  Both methods are too cumbersome for me, though they might satisfy you.

Instead, I found an Applescript method to produce a new Finder window.  I modified the code slightly to give focus to the new window and to start in my chosen directory:

on run
tell application "Finder"
set NewWindow to make new Finder window
set target of NewWindow to "Macintosh HD:Users:username:path"
activate
end tell
end run

Open AppleScript Editor (via Spotlight is easist) and paste in the code.  (Replace username with your username and the path as well.)  Save the file as something like new_finder_applescript in your Applications folder.  Set the format to Application and make sure that “Stay open after run handler” is unchecked.

Next, let’s replace the default applescript icon with something better.  I like this CC-licensed Finder icon by Gordon Irving.  Download the .icns version.  Open your Applications folder and Cmd-click on new_finder_applescript and select “Get Info.”  In the top left will be the default “rolled paper” Applescript icon.  Drag and drop the new icon file onto the default icon…

and the new icon should appear.

Finally, drag the new_finder_applescript icon into the Dock.  I like to place it just below the standard Finder (which can’t be removed from the Dock without terminal hacking).

Now, a quick click on your new icon will give you a Finder window in the current Space, no matter what!  It’s possible to bind your script to a hotkey with third-party programs like Quicksilver or FastScripts as well.

Four Short Links: An Appreciation

Twitter, Prismatic, Google Reader, and Hacker News keep me up to the minute on the tech news I care about.  While I tune my feeds to give me more signal and less noise, popular links frequently repeat: five people I follow on Twitter link to a story, then I find it on top of my Prismatic feed, and later it’s on the front page of Hacker News.

I therefore treasure orthogonal sources that break out of that echo chamber.  The “Four Short Links” series on the O’Reilly Radar blog is one of my favorites.  Every day Nat Torkington offers four links to under-the-radar news items joined with description and commentary of masterful brevity.  It holds pride of place among my RSS subscriptions; maybe you’ll like it, too.

Scientists: Career Change Starts Now

Jessica Kirkpatrick wrote a great post describing her move from astronomical research to a data science job at Yammer.  (We were classmates in grad school.)  She discusses the technical skills she needed to learn (IDL alone won’t get you a tech job) as well as the differences between business and academic culture.  (Peter Fiske’s book Put Your Science to Work can help guide scientists through that cultural translation.)

In the comments, a finance recruiter added a key point:

The only thing I would add is come prepared to explain your motivations for wanting to move to industry. It’s important you can convince your future employer that you are moving for the right reasons.

I’ve heard that refrain a lot lately:  from panelists discussing aerospace careers for astronomers at the AAS meeting, from Jake Klamka of the Insight Data Science Fellowship, from professors describing jobs at teaching-focused colleges.  Apparently it’s surprisingly common for scientists to come into interviews with an attitude of “My research funding ran out, but I’m super smart and ready to make some real money.  What do you do here again?”

Obviously that approach insults your potential coworkers.  It’s far more productive to demonstrate your enthusiasm for the job and the goals of the company.  The more your background differs from the job you’re applying for, the more effort you need to make to show those hiring that you can do the job and that you’ll fit in the company culture [1].

One of the staples of interview advice is to substantiate your claims with stories.  Accordingly, the best way to show your interest in your new industry is to point to previous work you’ve done in the field!  That could be an internship or volunteer work, a side project you did on your own time, a subject-focused blog you maintain, or attendance at a conference or meetup group.  (These explorations will also help you confirm your interest in the industry.)  Thus, smooth career changes out of academia begin with groundwork laid well before you actually change fields.

For data science, it’s particularly easy: as Hilary Mason points out, you can start doing data science right now!  There are many publicly-available datasets and apis, and many of the standard tools are open-source.  Brush up on your machine learning and join a Kaggle competition, or try making an engaging visualization.  Creative projects are fun, teach you new skills, and make you an easier hire all at once!

Ideas for Improving Your Scientific Visualizations

Scientific graphics are one of the most important means we have of communicating complicated quantitative information.  Here are a few ideas for improving the effectiveness of your figures:

  • Learn a new feature of your graphics package.  Most of us only use a fraction of the capabilities of our graphics programs.  There’s much to be gained from digging around in the documentation.  Users of python’s matplotlib might want to play around with the AxesGrid toolkit, learn about animation or widget capabilities, or look through these recipes.  IDL users could define more sensible plotting defaults.
  • Read a book.  Edward Tufte’s Visual Display of Quantitative Information is rightly a classic.  Its principles for evaluating the information density of figures will change how you think about graphical communication.  Nathan Yau‘s Visualize This (and his website, FlowingData) provide a look at today’s cutting edge of information graphics.
  • Learn about color perception.  Scientists love rainbow colormaps.  However, rainbow colormaps create perceptual artifacts that obscure the very trends you’re trying to display.  Perceptual colormaps are a better alternative.  The ColorBrewer colormaps are the most well-known.  Python’s matplotlib has them built-in (try help(plt.colormaps)), and versions are available for IDL as well.
  • Use alpha transparency.  Most modern graphics packages support semi-transparent elements via an alpha keyword, typically specified between 0.0 (completely transparent) and 1.0 (completely opaque).  Making lines or scatterplot points semi-transparent can improve the clarity of the plot if there is overlap or variation in point density.
  • Learn about web visualization.  Static figures aren’t going anywhere–they’re easy to print, share, and display on slides.  However, both the true day-to-day practice of science and the needs of large datasets require an exploratory approach to visualization.  Interactive analysis needs sophisticated software, however.  The most nearly universal means to share interactive visualizations is via the browser, with its widely available JavaScript runtime.
    Today, the d3.js library is the most popular JavaScript visualization library.  It is hugely capable, though with its power comes complexity.  Check out an online tutorial or a book to get started.