Big Data Reading Group

Information

What?
- This reading group will cover some of the most influential modern (last 5-10 years) developments in algorithms for large data processing.
- No background knowledge (except general mathematical maturity) is required.
When?
- Weekly meetings on Fridays, 3:30pm at Towne 311 during the Fall 2014 semester.
How can I join?
- Please, send an e-mail to Grigory Yaroslavtsev, grigory (at) grigory (dot) us to join the group's mailing list.
- Each participant will be expected to pick one of the papers for presentation and discussion with the rest of the group.
- It is expected that such presentation will cover technical parts of the paper in detail.
- Preparing slides is not necessary and it's perfectly fine if you use the paper and/or notes during the presentation.
What's next?
- Stay tuned: I will be teaching a related class next semester (Spring 2015). This reading group and the class are complementary and will cover different sets of topics. The main difference is that the class cover foundations of algorithms for massive data in more detail, while the reading group will focus on the most recent developments.
- Try some open problems: sublinear.info

Topics

Increasingly large datasets are being collected by governments, private companies and research institutions. This motivates increased interest in the design and analysis of algorithms for rigorous analysis of such data. In this reading group we will consider scenarios when the size of the data is too large to fit into the main memory of a single machine. Two main paradigms of computation that we will focus on are massively parallel computation (applicable to frameworks such as Yahoo!'s Hadoop, Google's MapReduce and Microsoft's Dryad) and streaming algorithms (Apache Storm and Spark Streaming).

Massively Parallel Algorithms
In massively parallel computational systems (clusters) the data is partitioned between a large number of identical machines connected via a high-speed network. An algorithm proceeds in synchronous rounds, each consisting of local computation performed by each machine followed by an exchange of information through the network. The typical goal of algorithm design is to minimize the number of synchronous rounds, together with optimizing the time/space, communication, etc.

Some papers suggested for reading:

Howard J. Karloff, Siddharth Suri, Sergei Vassilvitskii: A Model of Computation for MapReduce. SODA 2010.
Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, Sergei Vassilvitskii: Filtering: a method for solving graph problems in MapReduce. SPAA 2011.
Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, Sergei Vassilvitskii: Scalable K-Means++. VLDB 2012.
Jon Feldman, S. Muthukrishnan, Anastasios Sidiropoulos, Clifford Stein, Zoya Svitkina: On distributing symmetric streaming computations. SODA 2008.
Siddharth Suri, Sergei Vassilvitskii: Counting triangles and the curse of the last reducer. WWW 2011.
Bahman Bahmani, Kaushik Chakrabarti, Dong Xin: Fast personalized PageRank on MapReduce. SIGMOD 2011.

Streaming algorithms
Data streams represent a large dataset as a sequence updates to its entries. Streaming algorithms extract only a small amount of information about the data stream (a "sketch") and compute an answer based on that. Such algorithms are typically allowed to make only one pass over the data (or very few passes). The typical goal of algorithm design is to minimize the number of passes and space, while achieving a good approximation guarantee.

Some papers suggested for reading:

Graham Cormode, S. Muthukrishnan: An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. LATIN 2004, Imre Simon Test-of-Time Paper Award.
Edo Liberty: Simple and deterministic matrix sketching. KDD 2013, Best Paper Award.
Daniel M. Kane, Jelani Nelson, David P. Woodruff: An optimal algorithm for the distinct elements problem. PODS 2010, Best Paper Award.
Madhav Jha, C. Seshadhri, Ali Pinar: A space efficient streaming algorithm for triangle counting using the birthday paradox. KDD 2013, Best Student Paper Award.
Atish Das Sarma, Sreenivas Gollapudi, Rina Panigrahy: Estimating PageRank on graph streams. PODS 2008, Best Paper Award.

Schedule

Meeting 1, September 12: Organization, introduction. MapReduce and Streaming.
- Speaker: Grigory Yaroslavtsev.
- Slides: [].
Meeting 2, September 19: A Model of Computation for MapReduce.
- Speaker: Babak Bagheri Hariri.
- Paper: Howard J. Karloff, Siddharth Suri, Sergei Vassilvitskii: A Model of Computation for MapReduce. SODA 2010.
- Slides: [].
September 26: No meeting. Warren Center Fall Faculty Symposium.
Meeting 3, October 03: Scalable K-Means++.
- Speaker: Steven Wu.
- Paper: Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, Sergei Vassilvitskii: Scalable K-Means++. VLDB 2012.
- Slides: [].
October 10: No meeting. Fall Term Break.
Meeting 4, October 17: Filtering: a method for solving graph problems in MapReduce.
- Speaker: Justin Hsu.
- Paper: Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, Sergei Vassilvitskii: Filtering: a method for solving graph problems in MapReduce. SPAA 2011.
Meeting 5, October 24: Counting Triangles in MapReduce.
- Speaker: Ryan Rogers.
- Paper: Siddharth Suri, Sergei Vassilvitskii: Counting triangles and the curse of the last reducer. WWW 2011.
- Slides: [].
Meeting 6, October 31: Linear Sketching Graphs for Streaming and Distributed Computation.
- Speaker: Sepehr Assadi.
- Paper: Andrew McGregor: Graph Stream Algorithms: A Survey.
Meeting 7, November 07: Parallel Algorithms for Geometric Graph Problems.
- Speaker: Grigory Yaroslavtsev.
- Paper: Alexandr Andoni, Aleksandar Nikolov, Krzysztof Onak, Grigory Yaroslavtsev. Parallel Algorithms for Geometric Graph Problems. STOC 2014.
- Slides: [].
Meeting 8, November 14: Count-Min Sketch and Applications.
- Speaker: Steven Wu.
- Paper: Graham Cormode, S. Muthukrishnan: An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. LATIN 2004, Imre Simon Test-of-Time Paper Award.
- Slides: [].
Meeting 9, November 21: Simple and Deterministic Matrix Sketching.
- Speaker: Ryan Rogers.
- Paper: Edo Liberty: Simple and deterministic matrix sketching. KDD 2013, Best Paper Award.
- Slides: [].
November 28: No meeting. Happy Thanksgiving!
December 05: No meeting. NYCE and Williams Symposium.
Meeting 10, December 12: Fast Greedy Algorithms in MapReduce and Streaming.
- Speaker: Sepehr Assadi.
- Paper: Ravi Kumar, Benjamin Moseley, Sergei Vassilvitskii, Andrea Vattani: Fast Greedy Algorithms in MapReduce and Streaming. SPAA 2013. Best Paper Award.

Big Data Reading Group

Information

Topics

Massively Parallel Algorithms

Streaming algorithms

Schedule