CIS 700: algorithms for Big Data

Information

What? This class will give you a biased sample of techniques for scalable data anslysis. Target audience are students interested in algorithms, statistics, machine learning, data mining and related areas.
Who? Grigory Yaroslavtsev.
When? Fall 2015, MW 10:30 – 12:00
Where? Towne 307.
Need permission? Please, send me an e-mail with relevant coursework you have taken. In most cases permission will be granted.
Office hours? By appointment.
Prerequisites? There are no formal prerequisites, but you should be familiar with the basics of algorithm design and analysis, discrete mathematics, probability and have general mathematical maturity.
What's next? Consider starting a research project using techniques you learned in this class. You can also try some open problems: sublinear.info

Sketch^*

Increasingly large datasets are being collected by governments, private companies and research institutions. This motivates increased interest in the design and analysis of algorithms for rigorous analysis of such data. In this class we will consider algorithms for scenarios when the size of the data is too large to fit into the main memory of a single machine. Two main paradigms of computation that we will focus on are massively parallel computation (applicable to frameworks such as Yahoo!'s Hadoop, Google's MapReduce, Apache Spark and Microsoft's Dryad) and streaming algorithms (Apache Storm and Spark Streaming).

This class covers a rapidly evolving area. There is no textbook. Some related classes at other universities:

Algorithmic Techniques for Massive Data by Alexandr Andoni at Columbia
Algorithms for Big Data by Jelani Nelson at Harvard
Algorthms for Modern Data Models by Ashish Goel at Stanford
Data Mining by Edo Liberty at Yale
Data Stream Algorithms by Amit Chakrabarti at Dartmouth
Data Stream Algorithms by Andrew McGregor at UMass Amherst
Dealing with Massive Data by Sergei Vassilvitskii at Columbia
Mining Massive Data Sets by Jure Leskovec at Stanford
Randomized Algorithms for Matrices and Data by Michael Mahoney at Berkeley
Sublinear Algorithms by Eric Price at UT Austin
Sublinear Algorithms by Piotr Indyk and Ronitt Rubinfeld at MIT
Sublinear Algorithms by Sofya Raskhodnikova at Penn State

^* Usually there is an abstract here.

Lectures

This class will focus on fundamental principles of algorithm design for big data processing. Most of these principles are very robust to the choice of a specific platform, system or computational model. Thus, the lines drawn between different parts of this class are sometimes blurry and only serve as an approximate guideline.

Part 1: Streaming Algorithms
Data streams represent a large dataset as an arriving online sequence of updates to its entries. Streaming algorithms extract only a small amount of information about the dataset (a "sketch"), which approixmately preserves its key properties. Such algorithms are typically allowed to make only one pass over the data (or very few passes). The typical goal of algorithm design is to minimize the number of passes and space, while achieving the best possible approximation guarantee. All algrotihms discussed in this part are based on linear sketching, a powerful technique with multiple applications that go beyond streaming.

Week 1. [Slides (pptx, pdf)] Introduction. Probablity basics. Approximate counting, Morris's algorithm.
- Robert Morris: "Counting large numbers of events in small registers." Communications of the ACM, 1978.
Week 2. [Slides (pptx, pdf)] Approximating the median. AMS-sketching. Frequency moments via AMS. F₂-moment via 4-wise independent hashing. Distinct elements.
- Philppe Flajolet, Nigel Martin: "Probabilistic Counting Algorithms for Data Base Applications.", Journal of Computer and System and Sciences, 1985.
- Noga Alon, Yossi Matias, Mario Szegedy: "The Space Complexity of Approximating the Frequency Moments." Journal of Computer and System Sciences, 1999.
Week 3. [Slides (pptx, pdf)] Count-Min sketch, heavy hitters. L₂-sampling.
- Graham Cormode, S. Muthukrishnan: "An Improved Data Stream Summary: The Count-Min Sketch and Its Applications." LATIN 2004.
Week 4. [Slides (pptx, pdf)] L₀-sampling. L₁ sparse recovery. Count-Sketch.
- Hossein Jowhari, Mert Saglam, Gabor Tardos. Tight Bounds for L_p-Samplers, Finding Duplicates in Streams, and Related Problems. PODS 2011.
- Moses Charikar, Kevin Chen, Martin Farach-Colton. "Finding Frequent Items in Data Streams". Theoretical Computer Science, 2004; ICALP 2002.
Week 5. [Slides (pptx, pdf)] Dimension reduction, Johnson-Lindenstrauss transform. Hanson-Wright inequality. Sparse and fast JL-transform.
- William B. Johnson, Joram Lindenstrauss: "Extensions of Lipschitz mappings into a Hilbert space." Conference in Modern Analysis and Probability, 1982.
- Sanjoy Dasgupta, and Anupam Gupta: "An elementary proof of a theorem of Johnson and Lindenstrauss." Random Structures & Algorithms, 2003.
- David Hanson, Farroll Wright. "A bound on tail probabilities for quadratic forms in independent random variables." Annals of Mathematical Statistics, 1971
- Nir Ailon, Bernard Chazelle. "The Fast Johnson-Lindenstrauss Transform and Approximate Nearest Neighbors." SIAM Journal on Computing, 2009.
- Daniel Kane, Jelani Nelson: "Sparser Johnson-Lindenstrauss Transforms." Journal of the ACM, 2014.
Week 6. [Slides (pptx, pdf)] Graph sketching: Connectivity, K-connectivity, Bipartiteness, Minimum spanning tree, Min-cut and cut sparsifiers.
- Joan Feigenbaum, Sampath Kannan, Andrew McGregor, Siddharth Suri, and Jian Zhang: "On graph problems in a semi-streaming model." Theoretical Computer Science, 2005.
- Kook Jin Ahn, Sudipto Guha, and Andrew McGregor: "Analyzing graph structure via linear measurements." SODA 2012.
Week 7. [Guest lecture by Sepehr Assadi] Linear sketches for approximate matchings.
- Sepehr Assadi, Sanjeev Khanna, Yang Li, Grigory Yaroslavtsev: "Tight Bounds for Linear Sketches of Approximate Matchings", SODA 2016.

Part 2: Selected Topics
This part covers an assortment of selected topics in algorithms for large matrices and vectors, convex optimization and compressed sensing.

Week 1. [Slides (pptx, pdf)] Subspace embeddings, L₂-regression. Leverage score sampling, computing approximate leverage scores.
- David P. Woodruff: "Sketching as a Tool for Numerical Linear Algebra." Foundations and Trends® in Theoretical Computer Science, 2014.
Week 2. [Slides (pptx, pdf)] Smooth convex optimization, gradient descent, accelerated gradient descent, stochastic gradient descent.
- Yurii Nesterov: "A method of solving a convex programming problem with convergence rate O(1/k²)." Soviet Mathematics Doklady, 1983
- Stephen Boyd, Lieven Vandenberghe: "Convex Optimization", Cambridge University Press.
- Sebastian Bubeck: "Convex Optimization: Algorithms and Complexity", Foundations and Trends in Machine Learning, 2015.
Week 3. [Slides (pptx, pdf)] Compressed sensing, restricted isometry property.
- Avrim Blum, John Hopcroft, Ravi Kannan: "Foundations of Data Science", Chapter 10.3. 2015.
- Emmanuel J. Candès, Justin K. Romberg, Terence Tao: "Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information.", IEEE Transactions on Information Theory, 2006.

Part 3: Massively Parallel Algorithms
In massively parallel computational systems (clusters) the data is partitioned between a large number of identical machines connected via a high-speed network. An algorithm proceeds in synchronous rounds, each consisting of local computation performed by each machine followed by an exchange of information through the network. The typical goal of algorithm design is to minimize the number of synchronous rounds, while optimizing the time/space, communication, approximation, etc.

Week 1. [Slides (pptx, pdf)] Model for massively parallel computation. Sorting. Graph connectivity and applications. Algorithms for geometric graph problems.
- Jeff Dean, Sanjay Ghemawat: "MapReduce: Simplified Data Processing on Large Clusters", Communication of the ACM 2008, OSDI 2004.
- Howard Karloff, Siddharth Suri, Sergei Vassilvitskii: "A Model of Computation for MapReduce." SODA 2010.
- Owen O'Malley: "TeraByte sort on Apache Hadoop." 2008.
- Alexandr Andoni, Aleksandar Nikolov, Krzysztof Onak, Grigory Yaroslavtsev: "Parallel algorithms for geometric graph problems." STOC 2014.
Week 2. [Slides (pptx, pdf)] K-means: dimension reduction and parallel algorithms.
- Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, Sergei Vassilvitskii: "Scalable K-Means++." VLDB 2012.

Part 4: Sublinear Time Algorithms
In this part we will focus on algorithms, which have access to the entire dataset. However, the size of the data is prohibitively large so that we can only make a small number of carefully chosen queries to it. The goal of algorithm design is to approximate interesting parameters of the dataset and study its properties while minimizing the number of queries and running time.

Week 1. [Slides by Sofya Raskhodnikova (pptx, pdf)] Sublinear time approximation and property testing. Testing properties of images. Testing sortedness and monotonicity. Testing connectedness and approximation on graphs.
- Oded Goldreich, Dana Ron: "Property testing in bounded degree graphs." Algorithmica 2002, STOC 1997.
- Funda Ergün, Sampath Kannan, Ravi Kumar, Ronitt Rubinfeld, Mahesh Viswanathan: "Spot-Checkers." Journal of Computer System and Sciences 2000, STOC 1998.
- Bernard Chazelle, Ronitt Rubinfeld, Luca Trevisan: "Approximating the Minimum Spanning Tree Weight in Sublinear Time." SIAM Journal of Computing 2005, ICALP 2001.
- Sofya Raskhodnikova: "Approximate Testing of Visual Properties." RANDOM 2003.
Week 2. [Slides (pptx, pdf)] L_p-Testing. Sublinear algorithms for isotonic regression.
- Piotr Berman, Sofya Raskhodnikova, and Grigory Yaroslavtsev: "L_p-testing." STOC 2014.

Project

Deadline: December 18, 2015. Details about the project: [Slides (pptx, pdf)].

CIS 700: algorithms for Big Data

Information

Sketch*

Lectures

Part 1: Streaming Algorithms

Part 2: Selected Topics

Part 3: Massively Parallel Algorithms

Part 4: Sublinear Time Algorithms

Project

Sketch^*