This week I started teaching a graduate class called “Foundations of Data Science” that will be mostly based on an eponymous book by Avrim Blum, John Hopcroft and Ravi Kannan. The book is still a draft and I am using this version. Target audience includes advanced undergraduate and graduate level students. We had some success using this book as a core material for an undergraduate class at Penn this Spring (link to the news article). The draft has been around for a while and in fact I ran a reading group that used it four years back when I was in grad school and the book was called “Computer Science Theory for the Information Age”.
“Data Science” is one of those buzzwords that can mean very different things to different people. In particular, a new graduate Masters program in Data Science here at IU attracts hundreds of students from diverse backgrounds. What I personally really like about the Blum-Hopcroft-Kannan book is that it doesn’t go into any philosophy about the meaning of data science but rather offers a collection of mathematical tools and topics that can be considered as foundational for data science as seen from computer science perspective. It should be noted that just as any “Foundations of Computing” class has little to do with finding bugs in your code so do this class and book have little to do with data cleaning and other data analysis routine.
Topics
While the jury is still out on what topics should be considered as fundamental for data science I think that the Blum-Hopcroft-Kannan book makes a good first step in this direction.
Let’s look at the table of contents:
- Chapter 2 introduces basic properties of the high-dimensional space, focusing on concentration of measure, properties of high-dimensional Gaussians and basic dimension reduction.
- Chapter 3 covers the Singular Value Decomposition (SVD) and its applications (principal component analysis, clustering mixture of Gaussians, etc.).
- Chapter 4 focuses on random graphs (primarily in the Erdos-Renyi model).
- Chapter 5 introduces random walks and Markov chains, including Markov Chain Monte Carlo methods, random works on graphs and applications such as Page Rank.
- Chapter 6 covers the very basics of machine learning theory, including learning basic function classes, perceptron algorithm, regularization, kernelization, support vector machines, VC-dimension bounds, boosting, stochastic gradient descent and a bunch of other topics.
- Chapter 7 describes a couple of streaming and sampling methods for big data: frequency moments in streaming and matrix sampling.
- Chapter 8 is about clustering methods: k-means, k-center, spectral clustring, cut-based clustering, etc.
- Chapters 9 through 11 cover a very diverse set of topics that includes hidden Markov processes, graphical models, belief propagation, topic models, voting systems, compressed sensing, optimization methods and wavelets among others.
Discussion
Overall this looks like a good stab at the subject and a big advantage of this book is that unlike some of its competitors it treats its topics with mathematical rigor. The only chapter that I personally don’t really see fit into a “data science” class is Chapter 4. Because of its focus on the Erdos-Renyi model that I haven’t seen being used realistically for graph modeling applications this chapter seems to be mostly of purely mathematical interest.
Selection of some of the smaller topics is a matter of personal taste, especially when it comes to those that are missing. A couple of quick suggestions is to cover new sketching algorithms for high-dimensional linear regression, locality-sensitive hashing and possibly Sparse FFT.
Slides will be posted here and I will write a report on the final selection of topics and my experience in the end of semester. Stay tuned :)