The Data Science Core Curriculum
Jake Klamka spoke at Caltech a few months back about his Insight Data Science Fellowship–a program designed to help science PhDs transition into jobs in data science. The program guides scientists in packaging skills they already have so that employers can easily see the relevance and value to their business. Jake’s own initial difficulty getting a tech job inspired the program–after working as a particle physicist, he didn’t have any resources to help him determine which skills were important to learn and to help him get unstuck when he had problems.
There is no standard data science job, so there isn’t a standard set of skills for data scientists. Still, Jake identified a common set of basic skills for scientists to build first as a foundation:
- Python. In many ways the lingua franca of programming today, Python is an excellent all-in-one tool for everything from scripting to web programming to statistical analysis. R is a useful second language.
- Databases. SQL is a must. The Hadoop ecosystem is important, but it’s probably more practical to learn on the job.
- Computer Science fundamentals–Most scientists have no formal CS training, but knowledge of basic algorithms and data structures is vital for working with large datasets.
- Machine Learning–A high-level understanding of what’s possible will let you get started.
To these, I might add familiarity with the basic tools of software development (“software carpentry“), particularly version control and unit testing. Scientists presumably have ample experience with statistics and quantitative methods.
Data science in practice encompasses a huge and growing range of tools and techniques, but this core curriculum provides a manageable start. We live in a fortunate time–there are many excellent free online courses so you can learn these skills now, on your own time–and they’ll even be valuable in your current academic job!