“algorithms for Big Data” (sometimes the name can slightly vary) is a new graduate class that has been introduced by many top computer science programs in the recent years. In this post I would like to share my experience teaching this class at the University of Pennsylvania this semester. Here is the homepage.
First off, let me get the most frequently asked question out of the way and say that by “big data” in this class I mean data that doesn’t fit into a local RAM since if the data fits into RAM then algorithms from the standard algorithms curricula will do the job. At the moment a terabyte of data is already tricky to fit into RAM so this is where we will draw the line. In particular, this is so that the arguments about beating algorithms for big data using your laptop don’t apply.
Second, I tried to focus as much as possible on algorithms that are known to work in practice and have implementations. Because this is a theory class we didn’t do programming but I made sure to give links to publicly available implementations whenever possible. As it is always the case, the best algorithms to teach are never exactly the same as the best implementations. Even the most vanilla problem of sorting an array in RAM is handled in C++ STL via a combination of QuickSort, InsertionSort and HeapSort. Picking the right level of abstraction is always a delicate decision to make when teaching algorithms and I am pretty happy with the set of choices made in this offering.
Finally, “algorithms for Big Data” isn’t an entirely new phenomenon as a class since it builds on its predecessors typically called “Sublinear Algorithms”, “Streaming Algorithms”, etc. Here is a list of closely related classes offered at some other schools. In fact, my version of this class consisted of four modules:
- Part 1: Streaming Algorithms. It is very convenient to start with this topic since techniques developed in streaming turn out to be useful later. In fact, I could as well call this part “linear sketching” since every streaming algorithm that I taught in this part was a linear sketch. I find single-pass streaming algorithms to be the most motivated and for so-called dynamic streams that can contain both insertions and deletions linear sketches are known to be almost optimal under fairly mild conditions. Moreover, linear sketches are the baseline solution in the more advanced massively parallel computational models studied later.
- Part 2: Selected Topics. This part became very eclectic, containing selected topics in numerical linear algebra, convex optimization and compressed sensing. In fact, some of the algorithms in this part aren't even “algorithms for Big Data” according to the RAM size based definition. However, I considered these topics to be too important to skip in a “big data” class. For example, right after we covered gradient descent methods for convex optimization Google released TensorFlow. This state of the art machine learning library allows one to choose any of its 5 available versions of gradient descent for optimizing learned models. These days when you can run into some pretty steep pricing for outsourcing your machine learning to the cloud knowing what is under the hood of free publicly available frameworks I think is increasingly important.
- Part 3: Massively Parallel Computation. I am clearly biased here, but this is my favorite. Unlike, say, streaming where many results are already tight, we are still quite far from understanding full computational power of MapReduce-like systems. Potential impact of such algorithms I think is also likely to be the highest. In this class because of the time constraints I only touched the tip of the iceberg. This part will be expanded in the future.
- Part 4: Sublinear Time Algorithms. I always liked clever sublinear time algorithms, but for many years believed that they are not quite “big data“ since they operate under the assumption of random access to the data. Well, this year I had to change my mind after Google launched its Distributed Code Jam. I have to admit that I have no idea how this works on the systems level but apparently it is possible to implement reasonably fast random access to large data. The problems that I have seen being used for Distributed Code Jam allow one to use 100 nodes each having small RAM. The goal is to process a large dataset available via random access.
Overall parts 1 and 4 are by now fairly standard. Part 2 has some new content from David Woodruff’s great new survey. Some algorithms from it are also available in IBM’s Skylark library for fast computational linear algebra and machine learning. Part 3 is what makes this class most different from most other similar classes.
Mental Notes
Here is a quick summary of things I was happy with in this offering + potential changes in the future.
- Research insights. One of the main reasons why I love teaching is that it often leads to research insights, especially when it comes to simple connections I have been missing. For example, I didn't previously realize that one can use Lp-testing as a tool for testing assumptions about convexity and Lipschitzness used in the analysis of the convergence rate of gradient descent methods.
- Project. Overall I am very happy with the students' projects. Some students implemented algorithms, some wrote surveys and some started new research projects. Most unexpected to me were the projects done by non-theory students connecting their areas of expertise with the topics discussed in the class. E.g. surveys of streaming techniques used in natural language processing and bionformatics were really fun to read.
- Cross-list the class for other departments. It was a serious blunder on my behalf to not cross-list this class for other departments, especially Statistics and Applied Math. Given how much interest there is from other fields this is probably the easiest to fix and the most impactful mistake. Somehow some students from other departments learned about the class anyway and expressed their interest, often too late.
- New content. Because of time constraints I couldn't fit in some of the topics I really wanted to cover. These include coresets (there has been a resurgence of interest in coresets for massively parallel computing, but I didn't have time to cover it), nearest neighbor search (somehow I couldn't find a good source to teach from, suggestions are very welcome), Hyperloglog algorithm (same reason), more algorithms for massively parallel computing (no time), more sublinear time algorithms (no time). In the next version of this class I will make sure to cover at least some of these.
- Better structure. Overall I am pretty happy with the structure of the class but there is definitely room for improvement. A priority will be to better incorporate selected topics discussed in Part 2 into the overall structure of the class. In particular, convex optimization came a little out of the blue even though I am really glad I included it.
- Slides and equipment. I really like teaching with slides that contain only some of the material and use the blackboard to fill in the missing details and pictures. On one hand, slides are a backbone that the students can later use to catch up on the parts they missed. On the other hand, the risk of rushing through the slides too fast is minimized since the details are discussed on the board. Also a lot of time is saved on drawing pictures. I initially used Microsoft Surface Pro 2 to fill in the gaps on the tablet instead of the board but later gave up on this idea because of technical difficulties. Having a larger tablet would help too. I still think that the tablet can work but requires a better setup. Next time I will try to use the tablet again and post the final slides online.
- Assign homework and get a TA. Michael Kearns and I managed to teach “Computational Learning Theory” without a TA last semester so I decided against getting one for my class as well. This was fine except that having a TA for grading homework would have helped a lot.
- Make lecture notes and maybe videos. With fairly detailed slides I didn't consider lecture notes necessary. Next time it would be nice to have some since some of my fellow facutly friends asked for them. I think I will stick with the tested “a single scribe per lecture“ approach although I heard in France students sometimes collaboratively work on the same file during the lecture and the result comes out nice. When I had to scribe lectures I just LaTeXed them on the fly so I don't see why you can't do this collaboratively. As for videos, Jelani had videos from his class this time and they look pretty good.
- Consider MOOCing. Given that the area is in high demand doing a MOOC in the future is definitely an option. It would be nice to stabilize the content first so that the startup cost of setting up a MOOC could be amortized by running it multiple times.
Thanks
I am very grateful to my friends and colleagues discussions with whom helped me a lot while developing this class. Thanks to Alex Andoni, Ken Clarkson, Sampath Kannan, Andew McGregor, Jelani Nelson, Eric Price, Sofya Raskhodnikova, Ronitt Rubinfeld and David Woodruff (this is an incomplete list, sorry if I forgot to mention you). Special thanks to all the students who took the class and Sepehr Assadi who gave a guest lecture on our joint paper about linear sketches of approximate matchings.