Exploring the Data Universe

How is the US astronomy career pipeline changing?

Recently, the American Astronomical Society’s Committee on the Status of Women in Astronomy (CSWA) released a report on the demographics of US astronomers throughout the academic career cycle: graduate students, postdocs, and the various ranks of professors.  The major goal of the report (written by my friend, Prof. Meredith Hughes) was to assess the progress of women through the “pipeline” as a function of time: are women moving into the professor level in proportion to their increasing representation at the graduate student and postdoc levels, or are they “leaking out” of the pipeline?  The full report, summarized in this blog post, addresses this important question.

I was interested in a more basic question: how has the size of the pipeline itself changed over time?  That is, how many more (or fewer) grad students and postdocs are working in US astronomy compared to the number of professors over time?  The report provides proportions of the total number of men and women at each career stage by year (in 1992, 1999, 2003, and 2013), but I was curious about the totals.  Since the survey only covers a limited sample of institutions, it doesn’t represent the totality of the US astronomy job market, but it should provide a useful look.

In [10]:
%matplotlib inline
import pandas
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab

pylab.rcParams['figure.figsize'] = (8, 6)

Below is the data from Figures 2 and 3 of the CWSA report. Because the 2013 report adds 8 universities and 3 research institutes to the 32 institutions surveyed in previous years, I scale the 2013 values down by that fraction. Without the raw data, it’s hard to know whether adding the additional institutions (such as Goddard) bias the career stage proportion relative to previous years. The direct scaling below should provide at least a basic level of year-to-year consistency.

In [7]:
scale_2013 = 32./(32+8+3)
df_women = pandas.DataFrame.from_dict({"year":[1992,1999,2003,2013],"grad":[176,217,269,325],
                                       "postdoc":[63,90,137,145],"assistant":[29,45,34,34],
                                       "associate":[18,37,40,44],"full":[23,37,60,71]})
df_men = pandas.DataFrame.from_dict({"year":[1992,1999,2003,2013],"grad":[602,616,549,625],
                                       "postdoc":[301,359,473,377],"assistant":[140,212,182,96],
                                       "associate":[162,220,157,187],"full":[421,511,544,426]})
df = df_men + df_women
df.index = df['year']/2
del(df['year'])
df.ix[df.index==2013] *= scale_2013
df
Out[7]:
assistant associate full grad postdoc
year
1992 169.000000 180.000000 444.000000 778.000000 364.000000
1999 257.000000 257.000000 548.000000 833.000000 449.000000
2003 216.000000 197.000000 604.000000 818.000000 610.000000
2013 96.744186 171.906977 369.860465 706.976744 388.465116

4 rows × 5 columns

In [8]:
def outside_legend():
    # Shink current axis by 20%
    ax=plt.gca()
    box = ax.get_position()
    ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])

    # Put a legend to the right of the current axis
    ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

First, I plot total numbers of astronomers (men + women) in each career stage by survey year. I’ve lumped the “associate” and “full” professor categories together as “post-tenure,” although this may not be appropriate at all institutions. Note also that for research institutions, “professor” may indicate a staff position.

In [13]:
plt.plot(df.index,df["grad"],"+-",label="Grads")
plt.plot(df.index,df["postdoc"],"+-",label="Postdocs")
plt.plot(df.index,df["assistant"],"+-",label="Pre-Tenure Profs")
plt.plot(df.index,(df["associate"]+df["full"]),"+-",label="Post-Tenure Profs")
plt.ylim(0,1100)
plt.xlabel("Year")
plt.ylabel("Number in survey year")
outside_legend()

To my surprise, the number of astronomers of all career stages appears to have declined relative to a peak in the early 2000s, presumably reflecting flat funding profiles in the US and the lingering effects of the financial crisis.

Comparing raw numbers is somewhat misleading, however, as professors remain in that career stage far longer than postdocs and grad students. We can get a sense of the size of a “cohort” by dividing by a rough average number of years that astronomers remain in a career stage before moving on (or out of the academic pipeline). I have used 6 years for grad students, 3 for postdocs (although taking a second postdoc has become increasingly common in the last two decades), 7 for pre-tenure professors, and 28 for tenured professors.  Dividing by these values gives the rough number of individuals entering (and leaving, in steady state) a given career stage each year.

In [14]:
plt.plot(df.index,df["grad"]/6.,"+-",label="Grads")
plt.plot(df.index,df["postdoc"]/3.,"+-",label="Postdocs")
plt.plot(df.index,df["assistant"]/7.,"+-",label="Pre-Tenure Profs")
plt.plot(df.index,(df["associate"]+df["full"])/28.,"+-",label="Post-Tenure Profs")
plt.xlabel("Year")
plt.ylabel("Individuals per cohort")
outside_legend()

This plot highlights a large excess of postdocs in the early 2000s, presumably due to the influx of funds from to NASA’s Great Observatories (Chandra launched in 1999 and Spitzer in 2003).

Finally, we come to the sticky question: is the pipeline to professorship wider or narrower than it used to be?

Ideally, one would track a defined group through time–surveying the career outcomes of an unbiased sample of PhDs every five years, for example. The CSWA report performs survival analysis using the 1992/2003 and 2003/2013 survey pairs, but the format of the survey can’t ensure that those counted as grads or postdocs at one of the survey periods are among those being counted in the next. (A precocious senior grad student in 2003 could well be tenured by 2013, but would not be counted as “surviving” into the assistant professor stage.)

A more direct means of assessing the width of the pipeline is to compare the relative proportions of grads, postdocs, and assistant professors at each survey interval and assume a steady state. That is, if we keep graduating PhDs, hiring postdocs, and hiring assistant professors/research staff at the rate we are today, what is the oversupply ratio? This is the value most of interest to students already in the pipeline, as it indicates the amount of competition for permanent jobs. (This implicitly assumes the net flux of international astronomers into the US is zero over all career stages.)

In [15]:
plt.plot(df.index,(df["grad"]/df["assistant"]),"+-",label="Grads per new prof")
plt.plot(df.index,(df["postdoc"]/df["assistant"]),"+-",label="Postdocs per new prof")
plt.xlabel("Years")
plt.ylabel("Steady-state oversupply ratio")
outside_legend()

If we take these numbers at face value, they suggest that current postdocs face a one in four chance of getting a permanent position in astronomy, while there are more than seven current grad students for every new professor. That’s somewhat worse than implied by older data gleaned from the AAS job register (which has its own biases, particularly in undercounting postdocs). Whether this trend is real, a sampling issue, or a lingering artifact of the financial crisis is not clear.

Current and prospective students must be aware of these numbers and trends in order to be able to make well-informed career choices.   We should also improve graduate training to prepare students for a wide variety of careers, since the majority of astronomy PhDs won’t get permanent jobs in the field.  I’ll outline some possibilities in a future post.

The Best Books I Read in 2013

Following the 2011 and 2012 editions, here are the most interesting books I read this year:

When God Talks Back: Understanding the American Evangelical Relationship with God
T. M. Luhrmann
How do evangelical Christians in modern America come to believe that God speaks to them directly, as individuals? A tour de force of scholarship, synthesizing history, anthropology, and psychology, with much to offer skeptic and believer alike.

A Grand and Bold Thing
Ann Finkbeiner
A perceptive history of the Sloan Digital Sky Survey, perhaps the most successful astronomical project of all time.  Finkbeiner’s unsentimental rendering of its difficult birth is a clear reminder that science is above all a human endeavor.

The Signal and the Noise: Why So Many Predictions Fail–but Some Don’t
Nate Silver
Why are forecasts better in some fields than in others?  Silver draws examples from a wide range of fields to highlight the importance of rich data, regular feedback, and underlying causal mechanisms and the dangers of out-of-sample predictions and overfitting.

Homeward Bound: Why Women Are Embracing the New Domesticity
Emily Matchar
A balanced examination of why affluent, well-educated young women today are dropping out of the workforce and canning, growing chickens, making meals from scratch, and parenting intensively.  Aptly captures the appeal of the back-to-basics lifestyle (and the role of the Internet in promoting it) as well as potential risks for individual women and communities.

What Money Can’t Buy: The Moral Limits of Markets
Michael Sandel
Argues that the proliferation of market solutions and economic thinking is leading us astray in addressing some moral questions better determined by community deliberation.  In an age of inequality, markets do not neutrally allocate goods, and in some cases applying market logic undermines other values we would like to encourage.  Attempts to persuade gently using many real examples; some are more compelling than others.

Seven Days in the Art World
Sarah Thornton
A deftly composed series of vignettes exploring the sometimes rarefied spheres of contemporary art: auctions, art school “crits,” art fairs and biennales, studio visits, and museum prize exhibitions.  Surprisingly candid interviews with minor and very major players give perspective on the politics but also the pleasures of life in the art world.

IDL Magics for the IPython Notebook

The IPython Notebook combines code, documentation, and computational results in one package that’s easy to share.  It’s proving a great way to teach, as notebooks are easy for the instructor to write and for students to modify.  Notebooks also provide seamless integration with other programming environments through extensions providing “magic functions“: short code prefixes beginning with % or %% that call other interpreters, like R or Octave.

Since many astronomers who might want to transition to Python already know IDL, I wanted to provide an %idl magic function for IPython so they could easily call existing code.  The difficult interfacing between IDL and Python was already done by Anthony Smith‘s pIDLy package.  (As a bonus, it supports the free GDL interpreter as well.)  I adapted the Octave magic code to provide %idl magic functions with consistent syntax.  You can find the code on github as ipython-idlmagic.

The demonstration notebook below is available on github as well, or you can view it with nbviewer.

Installation

To begin, we install pIDLy:

pip install pidly

Then we install idlmagic:

In [ ]:
%install_ext https://raw.github.com/ebellm/ipython-idlmagic/master/idlmagic.py

Usage

When starting a new notebook, we load the magic:

In [1]:
%load_ext idlmagic
IDL not found, using GDL

(I am using GDL rather than IDL on this computer. idlmagic will first look for the idl interpreter on the search path and and fall back to gdl if needed.)

Line magics

The %idl magic enables one-line execution of IDL commands in the IPython interpreter or notebook:

In [2]:
%idl print, findgen(5)
      0.00000      1.00000      2.00000      3.00000      4.00000

(Note that the %idl line magic fails with TypeError: coercing to Unicode: need string or buffer, dict found in current release versions of IPython (0.13.2 and below) due to a known bug; the github development version of IPython works as expected.)

Cell magics

Multi-line input can be entered with the %%idl cell magic:

In [3]:
%%idl
x = findgen(5)
y = x^2.
; comments are supported
print, $ ; as are line continuations
mean(y)
% Compiled module: MEAN.
      6.00000

Passing variables between Python and IDL

The mechanisms for passing variables to and from IDL are based on those in the built-in %R and %octave magics.

Variables may be pushed from Python into IDL with %idl_push:

In [4]:
msg = '  padded   string   '
import numpy as np
arr = np.arange(5)
In [5]:
%idl_push msg arr
In [6]:
%%idl
print, strcompress(msg,/REMOVE_ALL)
print, reverse(arr)
paddedstring
              4                     3                     2
              1                     0

Similarly, variables can be pulled from IDL back to Python with %idl_pull:

In [7]:
%idl arr += 1
In [8]:
%idl_pull arr
In [9]:
arr
Out[9]:
array([1, 2, 3, 4, 5])

Variables can also be pushed and pulled from IDL inline using the -i (or --input) and -o (or --output) flags:

In [10]:
Z = np.array([1, 4, 5, 10])
In [11]:
%idl -i Z -o W W = sqrt(Z)
In [12]:
W
Out[12]:
array([ 1.        ,  2.        ,  2.23606801,  3.1622777 ])

Plotting

Inline plots are displayed automatically by the IPython notebook. IDL Direct graphics are used. The optional -s width,height argument (or --size; default: 600,375) specifies the size of the resulting png image.

In [13]:
%%idl -s 400,400
plot,findgen(10),xtitle='X',ytitle='Y'
% Compiled module: WRITE_PNG.

Known issues and limitations

  • The %idl line magic fails with TypeError: coercing to Unicode: need string or buffer, dict found in current release versions of IPython (0.13.2 and below) due to a known bug; the github development version of IPython works as expected.
  • Only one plot can be rendered per cell
  • Processing for possibly unused plot output slows execution
  • Scalar variables from IDL may be returned as single-element Numpy arrays

The plotting capabilities are rather kludgy due to IDL’s old-school graphics routines.  I opted to implement Direct Graphics, which are the lowest common denominator supported by IDL and GDL.  Since I have to initialize the device before the plot call and close it afterwards, the %idl magic can only produce one plot output per notebook cell.  The plotting code produces overhead on non-plotting lines as well, unfortunately; I chose syntactic simplicity over execution speed for the time being.

Stay tuned to see it in action!

Bitten by Sample Selection Bias

At this week’s meeting of the High Energy Astrophysics Division of the American Astronomical Society, I learned that one of the teams analyzing data from a NASA observatory had run into trouble with their machine learning classifier.  Their problem illustrates one of the particular challenges of machine learning in astronomy: sample selection bias.

All-sky map of Fermi sources (Nolan et al. 2012).

All-sky map of Fermi sources (Nolan et al. 2012).

At first glance, it looks like a classic machine learning problem.  The Fermi Gamma-Ray Space Telescope has performed a highly sensitive all-sky survey in high-energy gamma-rays and detected nearly two thousand sources, most of which are unidentified.  It’s expected that most of these will either be pulsars–rapidly-rotating neutron stars–or active galactic nuclei (AGN), accreting supermassive black holes in distant galaxies.  Many teams (including one I work with) are interested in detecting the Fermi pulsars in radio or optical bands.  We’re interested because many of these pulsars proving to be rare “black widow” systems, where the pulsar in the process of destroying its low-mass companion and it is possible to determine the mass of the neutron star.  However, telescope time is precious, so there needed to be a way to prioritize which systems to follow up first.

The pulsars and AGN are distinguished by their variability in the gamma-ray band and by the presence or absence of curvature in their spectra.  It seems obvious, then, to train a classifier on the data for the known systems and then use it to predict the class of the unknown systems, and several groups did just that.

Spectral curvature vs variability for Fermi sources by source class (Nolan et al 2012).

Spectral curvature vs variability for Fermi sources by source class (Nolan et al. 2012).

However, when the Pulsar Search Consortium searched the newly-prioritized list, they found fewer new pulsars, not more.  The problem turned out to be a bias in the training set: the known sources used for training are brighter than the unknown ones being predicted.  In this case, the major effect was that the fit of the spectral curvature was inconclusive for dim sources–there weren’t enough counts to tell a curved line from a straight one, so the features fed into the classifier weren’t reliable.  As usual, garbage in, garbage out.

In astronomy, this situation is quite frequent: the “labelled data” you’d like to use to train a classifier on is often systematically different from that which you’re trying to classify.  It may be brighter, nearer, or observed with a different instrument, but in any case blindly extrapolating to new regimes is unlikely to yield reliable results.  Joey Richards and his collaborators describe this problem of sample selection bias in great detail in their excellent paper, where they focus on the challenge of classifying different types of variable stars from their light curves.  They find that iterative Active Learning approaches are most effective at developing reliable classifiers when the unlabelled and labelled data may be drawn from different populations.  In Active Learning, the classifier identifies the unlabelled instances whose classification would most improve the performance of the classifier as a whole.  These targets can then be followed up in detail, classified, and the process repeated.

This approach worked well for variable star problem, where the features used in the classifier were valid for all the sources.  For the Fermi problem, the challenge is that one of the most informative parameters is unreliable for a subset of the sources.  In this case it might be more useful to develop additional features that might identify spectral characteristics even in the low-flux regime.

Making Space in LaTeX Documents

A recent major proposal deadline gave me a chance to brush up on my LaTeX skills. As a rule, it’s better to make your proposal more concise than to play formatting tricks to squeeze more text in. For this proposal, though, I needed the big guns–for some sections the instructions alone were a significant fraction of the allotted space! Below are some tested methods for cramming more material into your page limit:

  1. Choose your text size. As a first step, be sure your text is set to the minimum allowed point size.
    \documentclass[11pt]{article}
  2. Get the margins right. The geometry package provides the easiest means to specify margins.
    \usepackage[paper=letterpaper, margin=1in, 
           nohead, pdftex]{geometry}

    If there aren’t firm margin requirements, the fullpage package is an alternative:

    \usepackage[cm]{fullpage}
  3. Compress your lists. Standard LaTeX list environments leave lots of whitespace between the items. The paralist package provides compactitem and compactenum, drop-in replacements for itemize and enumerate.
    \usepackage{paralist}
    
    \begin{compactitem}
    \item Item text.
    \end{compactitem}
  4. Use runin headers. If your document is of any length, it’s helpful to organize it into parts, sections, subsections, and possibly even subsubsections. Standard LaTeX classes give each of these large headings on their own lines. The titlesec package provides an alternative: run-in headers. These appear in-line in the text, saving space. You can adjust the format and numbering with the \titleformat command. The commands below set up small-caps part headers on their own lines (“hang”), and variously sized run-in bold headings for sections.
    \usepackage[compact,medium]{titlesec}
    \titleformat{\part}[hang]
    {\Large\scshape}{}{0pt}{}
    \titleformat{\section}[runin]
    {\large\fontseries{b}\selectfont\filright}{}{0pt}{}
    \titleformat{\subsection}[runin]
    {\normalfont\fontseries{b}\selectfont\filright}{}{0pt}{}
    \titleformat{\subsubsection}[runin]
    {\fontshape{it}\selectfont\filright}{}{0pt}{}
  5. Use superscript citations. If your field allows it, no citation notation uses less space than superscripted numbers. The natbib package makes it easy.
    \usepackage[super,sort&compress]{natbib}
    \bibpunct{}{}{,}{s}{}{,}
  6. Put captions beside floats. Figures and tables can end up with unused whitespace on the sides. The excellent LaTeX Wikibook provides several suggestions, including using the wrapfig package to wrap the text or using a minipage or the sidecap package to move the caption beside the text. I have used floatrow to accomplish the same task:
    \usepackage{floatrow}
    
    \begin{figure}
    \floatbox[{\capbeside\thisfloatsetup{capbesideposition={right,center},capbesidewidth=0.46\textwidth}}]{figure}[\FBwidth]
    {\caption{
    {\small
    Caption text here.
    \label{fig:myfig}
    }}}
    {\includegraphics[width=0.52\textwidth]{myfig.pdf}}
    \end{figure}
  7. Single-space the bibliography. The code below removes the bibliography section label and single-spaces the entries: put it in the document header.
    % no title on bibliography header: this duplicates the section of
    % article.cls, removing the refname section
    \makeatletter
    \renewenvironment{thebibliography}[1]{%
    \list{\@biblabel{\@arabic\c@enumiv}}%
    {\settowidth\labelwidth{\@biblabel{#1}}%
    \leftmargin\labelwidth
    \advance\leftmargin\labelsep
    \@openbib@code
    \usecounter{enumiv}%
    \let\p@enumiv\@empty
    \renewcommand\theenumiv{\@arabic\c@enumiv}}%
    \sloppy
    \clubpenalty4000
    \@clubpenalty \clubpenalty
    \widowpenalty4000%
    \sfcode`\.\@m}
    {\def\@noitemerr
    {\@latex@warning{Empty `thebibliography' environment}}%
    \endlist}
    \makeatother
    
    % make bibliography single-spaced
    \let\oldthebibliography=\thebibliography
    \let\endoldthebibliography=\endthebibliography
    \renewenvironment{thebibliography}[1]{%
    \begin{oldthebibliography}{#1}%
    \setlength{\parskip}{0ex}%
    \setlength{\itemsep}{0ex}%
    }%
    {%
    \end{oldthebibliography}%
    }
  8. Use or make a compact bibliography style. Exclude anything you can from the reference list. I made a BibTeX style file which defaults to “et al.” anytime there is more than one author.
  9. Discourage floats from getting their own pages. LaTeX uses a number of numeric weights to calculate where to position floats. Fiddling these parameters in the document header will encourage LaTeX to place them closer to each other and the text.
    % discourage floats from getting their own page
    \renewcommand\floatpagefraction{.9}
    \renewcommand\topfraction{.9}
    \renewcommand\bottomfraction{.9}
    \renewcommand\textfraction{.1}   
    \setcounter{totalnumber}{50}
    \setcounter{topnumber}{50}
    \setcounter{bottomnumber}{50}
    
    % shrink space between/after figures:
    \setlength{\textfloatsep}{10pt plus 1.0pt minus 2.0pt}
    \setlength{\floatsep}{10pt plus 1.0pt minus 2.0pt}
    \setlength{\intextsep}{10pt plus 1.0pt minus 2.0pt}
  10. Vacuum up the extra whitespace. Several more header parameters adjust white space between document elements.
    % Reduce space between section titles
    % Arguments are space before, vertical space, and space after
    \titlespacing{\part}{0pt}{*0}{2ex}
    \titlespacing{\section}{0pt}{2pt}{1ex}
    \titlespacing{\subsection}{0pt}{2pt}{1ex}
    \titlespacing{\subsubsection}{0pt}{*0}{1ex}
    
    % suck up extra white space
    \setlength{\parskip}{0pt}
    \setlength{\parsep}{0pt}
    \setlength{\headsep}{0pt}
    \setlength{\topskip}{0pt}
    \setlength{\topmargin}{0pt}
    \setlength{\topsep}{0pt}
    \setlength{\partopsep}{0pt}
  11. (Adjust or remove indentation.) Changing the paragraph indentation level can recover a few characters, but it makes the text harder to scan.
    \setlength{\parindent}{0in}
  12. (Black Hat: Shrink the inter-line spacing.) While I’m not comfortable with this measure and it makes the text hard to read, it is possible to make the line spacing less than one.
    \linespread{0.8}

As usual with LaTeX, there are multiple ways to accomplish the same goals–these are methods I personally have found convenient. The savetrees package provides an all-in-one solution which may be sufficient if you don’t want to tune the document style yourself.

The Data Science Core Curriculum

Jake Klamka spoke at Caltech a few months back about his Insight Data Science Fellowship–a program designed to help science PhDs transition into jobs in data science.  The program guides scientists in packaging skills they already have so that employers can easily see the relevance and value to their business.  Jake’s own initial difficulty getting a tech job inspired the program–after working as a particle physicist, he didn’t have any resources to help him determine which skills were important to learn and to help him get unstuck when he had problems.

There is no standard data science job, so there isn’t a standard set of skills for data scientists.  Still, Jake identified a common set of basic skills for scientists to build first as a foundation:

  • Python.  In many ways the lingua franca of programming today, Python is an excellent all-in-one tool for everything from scripting to web programming to statistical analysis.  R is a useful second language.
  • Databases.  SQL is a must.  The Hadoop ecosystem is important, but it’s probably more practical to learn on the job.
  • Visualization, particularly with d3.js. Scientists understand the importance of visualization in making sense of data; d3 is a modern, web-first framework for building interactive visualizations.  (Bonus: it helps you learn some JavaScript.)
  • Computer Science fundamentals–Most scientists have no formal CS training, but knowledge of basic algorithms and data structures is vital for working with large datasets.
  • Machine Learning–A high-level understanding of what’s possible will let you get started.

To these, I might add familiarity with the basic tools of software development (“software carpentry“), particularly version control and unit testing.  Scientists presumably have ample experience with statistics and quantitative methods.

Data science in practice encompasses a huge and growing range of tools and techniques, but this core curriculum provides a manageable start.  We live in a fortunate time–there are many excellent free online courses so you can learn these skills now, on your own time–and they’ll even be valuable in your current academic job!

Tracing the Changing State of the Union with Text Analysis

U.S. Presidents since George Washington have delivered State of the Union addresses each year to describe the nation’s condition and prioritize future action.  Can we glean historical patterns from the texts?  Do presidents speak similarly in times of war or depression?  Do Republicans and Democrats emphasize different words?  How does the evolution of American English affect the speeches?

To explore these questions, I performed textual analysis of all of the State of the Union Addresses given through 2012.  (For those interested, technical details are at the end of the post.)  I broke the addresses into words, removing words like “the” and “and” and weighting uncommon words more highly.  Then I measured the similarity of each pair of presidents by counting the words they used in common.

This figure shows the pairwise similarities for presidential State of the Union addresses–click for an interactive version.  Blue squares indicate that the presidents used few words in common, while white and red imply more overlap in vocabulary.  The red diagonal line is the similarity of each president to himself, which is obviously 100%.  I’ve color-coded the presidents by their political party.

Several interesting trends appear.  First, the dominant effect is that presidents nearer in time use more similar words.  Second, there seems to be a general separation before and after the early part of the 20th century (Wilson-Hoover): there is a great deal of overlap among the early presidents and among the post-WWII presidents, but less similarity between those groups.

We can look for bulk similarity and difference by aggregating the similarities for each president and looking at the range of values (excluding the self-comparison):

Few presidents seem to generate unusual influence on later addresses.  However, some presidents use noticeably different language than their peers, including John Adams, James Polk, Warren Harding, and George W. Bush.

Finally, each president’s words define a direction in vocabulary space.  We can project the presidents’ addresses into two dimensions, spacing the points by their relative distances in vocabulary space:

Closer presidents in this projection are more similar, so we can easily see that temporal order rather than political party seems to be the dominant effect influencing the language that presidents use in their State of the Union addresses.

Read the rest of this entry »

Python for IDL Users I: Ecosystem

Python is often the language of choice for today’s cutting-edge astronomical software.  Scientists wishing to take advantage of this powerful and growing ecosystem face the hurdle of learning a new programming language.  Thankfully, with the rapid growth of scientific Python, a number of excellent comprehensive tutorials have been developed, many particularly for astronomers:

All of these are great resources for getting up to speed in Python.  Here, I’d like to cover similar ground (installation, packaging, language features, etc.), but focusing explicitly on how Python differs from the language most astronomers know best: IDL.  My hope is that direct comparison and contrast with a familiar language will make the learning curve easier.

Installation

The installation process for IDL is more straightforward than that of Python, and that fact likely deters busy scientists from even giving Python a try [1].  IDL programs run on  a proprietary interpreter sold by RSI ITT Exelis.  Accordingly, your system administrator has to manage software licenses, so they have likely already installed IDL on public servers and maybe even set your IDL_PATH to point to the IDL Astronomy User’s Library.  In this case, you don’t install IDL at all!  Likewise, installing IDL on your laptop generally just requires a simple point-and-click installer, configuring a license file, and setting your IDL_PATH to point to appropriate libraries, which are just directories of IDL programs.

Python, in contrast, is free software that is under active development.  Moreover, the packages useful for science are not part of the core language runtime and are developed independently.  This situation is an advantage for Python insofar as it enables a broad software ecosystem, but it makes installation more complex.  (Also, Python packages often glue in existing C, C++, and FORTRAN libraries which need to compile during installation.)

One additional complication–hopefully to resolve itself naturally soon–is the Python version split.  Python 3 introduced backwards-incompatible changes to the language.  Since most of the power of Python is in the external packages, scientific migration to Python 3 has been slow.  Accordingly, for now it’s best to continue working in Python 2.7 (the current and final pre-Python 3 release) rather than Python 3.3 (the newest version).

There are several methods of installing Python and its external packages on Mac OS X and Linux; I have ordered these roughly by popularity.  (I don’t have enough experience with Python on Windows to make recommendations.)

  • Use a package manager.  In my opinion, this is the best choice for long-term Python use.  Mac OS X programs like Macports (or familiar Linux equivalents like apt-get or yum) make it easy to keep installed packages up-to-date.  These work primarily on the command line (and install non-python software as well).  You choose which packages to install (e.g., sudo port install py27-numpy), and at later times a simple sudo port selfupdate command will bring all of your libraries up to date.  The package manager handles all the tricky details of library dependencies, compilers, and paths for you.  An excellent step-by-step guide for installing astronomical python via Macports is here.
  • Use an all-in-one installer.  Many astronomers appreciate the one-click convenience provided by the Enthought Python Distribution (EPD) [2], which bundles together many (but not all) of the major scientific python libraries.  I have personally found it convenient for getting a new version of python on shared servers where I didn’t have root access.  However, EPD costs $199 for a single-user license.  While this fee supports further Python development, it means that the EPD distribution requires license files, even for the free version offered to academics.  My experience has been that this licensing complicates upgrading packages or EPD itself and negates Python’s advantage of being free to install on all of your systems.
  • Maintain your installations manually.  It’s possible–but far more complicated–to maintain your packages manually.  Manual installation isn’t beginner-friendly, though–not all programs are packaged in the same way, and you have to manage dependencies, upgrades, and paths yourself.  If you go this route, pip is currently the installation method of choice, and virtualenv is recommended to create isolated Python environments.  See the installation section of this guide for more tips.
  • Use the built-in system Python.  Not recommended!  While both Mac OS X and Linux are likely to have system versions of Python installed, they may change in system upgrades and break your code.  Moreover, the packages you install may cause conflicts with routines your computer depends on.  However, if you just want to play with the language, though, simply typing python in a terminal will give you a way to start.

Next time: IDL and Python, head to head!

Getting a new OS X Finder window in the current Space

Mac OS X provides multiple desktops via the Spaces feature.  I use this feature all the time to keep my mental workspace uncluttered: typically I have one Space with a web browser, one or two more with terminals (iTerm2 is great), and another with documents or presentations.

In Apple’s Spaces model, if you click on the icon of an already-running program in the Dock, it takes you to the Space where that application’s windows are.  For example, if I’m in Space 3 and have iTunes open in Space 1, clicking the iTunes icon will move me to Space 1.

This automatic space-switching behavior is not what I want with Finder.  When I click on the Finder in the Dock I want a new Finder window in the current space.  What happens instead is unpredictable.  If I have no Finder windows open in any space, I get what I want:  a new Finder window where I am now.  However, if I have forgotten Finder windows in other Spaces, OS X will zip me away from what I’m doing to those old Finder windows.  That unexpected switch destroys whatever sense of flow I have as I drag back to my current Space.

Two methods for getting a new Finder window in the current space don’t work well for me personally.  The first is to right click on the Finder icon in Dock and select “New Finder Window.”   The second method is to switch to Finder (either by Cmd-Tab or by clicking on the desktop), then hitting Cmd-N.  Both methods are too cumbersome for me, though they might satisfy you.

Instead, I found an Applescript method to produce a new Finder window.  I modified the code slightly to give focus to the new window and to start in my chosen directory:

on run
tell application "Finder"
set NewWindow to make new Finder window
set target of NewWindow to "Macintosh HD:Users:username:path"
activate
end tell
end run

Open AppleScript Editor (via Spotlight is easist) and paste in the code.  (Replace username with your username and the path as well.)  Save the file as something like new_finder_applescript in your Applications folder.  Set the format to Application and make sure that “Stay open after run handler” is unchecked.

Next, let’s replace the default applescript icon with something better.  I like this CC-licensed Finder icon by Gordon Irving.  Download the .icns version.  Open your Applications folder and Cmd-click on new_finder_applescript and select “Get Info.”  In the top left will be the default “rolled paper” Applescript icon.  Drag and drop the new icon file onto the default icon…

and the new icon should appear.

Finally, drag the new_finder_applescript icon into the Dock.  I like to place it just below the standard Finder (which can’t be removed from the Dock without terminal hacking).

Now, a quick click on your new icon will give you a Finder window in the current Space, no matter what!  It’s possible to bind your script to a hotkey with third-party programs like Quicksilver or FastScripts as well.

Four Short Links: An Appreciation

Twitter, Prismatic, Google Reader, and Hacker News keep me up to the minute on the tech news I care about.  While I tune my feeds to give me more signal and less noise, popular links frequently repeat: five people I follow on Twitter link to a story, then I find it on top of my Prismatic feed, and later it’s on the front page of Hacker News.

I therefore treasure orthogonal sources that break out of that echo chamber.  The “Four Short Links” series on the O’Reilly Radar blog is one of my favorites.  Every day Nat Torkington offers four links to under-the-radar news items joined with description and commentary of masterful brevity.  It holds pride of place among my RSS subscriptions; maybe you’ll like it, too.