The Big Data TheoryJekyll2016-12-31T22:59:08+00:00http://grigory.us/blog/Grigory Yaroslavtsevhttp://grigory.us/blog/grigory@grigory.us<![CDATA[What's New in the Big Data Theory 2016]]>http://grigory.us/blog/whats-new-in-big-data-theory-20162016-12-30T00:00:00+00:002016-12-30T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<div align="center"><img alt="Happy 2017!" src="http://grigory.us/blog/pics/o2016.png" /> </div>
<p><br /></p>
<p>This post will give an overview of papers on theory of algorithms for big data that caught my attention in 2016.
The basic rule that I used when making the list was whether I can see these results being included into some of the advanced graduate classes on algorithms in the future.
Also, while I obviously can’t include my own results here, among my own 2016 papers my two personal favorites are <a href="http://grigory.us/files/soda16.pdf">tight bounds on space complexity of computing approximate matchings in dynamic streams</a> (with S. Assadi, S. Khanna and Y. Li) and the <a href="http://eccc.hpi-web.de/report/2016/174/"><script type="math/tex">\mathbb F_2</script>-sketching paper</a> (with S. Kannan and E. Mossel and some special credit to Swagato Sanyal who subsequently improved the dependence on error in one of our main theorems).</p>
<p>It’s been a great year with several open problems resolved, old algorithms improved and new lines of research started.
All papers discussed below are presented in no particular order and their selection is clearly somewhat biased towards my own research interests.</p>
<h2>Maximum Weighted Matching in Semi-Streaming</h2>
<p>Sweeping both the best paper and the best student paper awards at the upcoming 28th ACM Symposium on Discrete Algorithms is a paper on semi-streaming algorithms for maximum weighted matching by graduate students Ami Paz and Gregory Schwartzman.
In semi-streaming we are given one pass over edges of an <script type="math/tex">n</script>-vertex and only <script type="math/tex">\tilde O(n)</script> bits of space.
It is easy to get a 2-approximation to the maximum matching by just maintaining the maxim<strong>al</strong> matching of the graph.
However, for weighted graphs maximal matching no longer guarantees a 2-approximation.</p>
<p>A long line of work has previously given constant factor approximations for this problem and finally we have a <script type="math/tex">2+\epsilon</script>-approixmation.
It is achieved via a careful implementation of the primal-dual algorithm for matchings in the semi-streaming setting.
It may seem somewhat surprising that primal-dual hasn’t been applied to this problem before since in the area of approximation algorithms it is a pretty standard way of reducing weighted problems to their unweighted versions, but the exact details of how to implement primal-dual in the streaming setting are quite delicate. I couldn’t find a version of this paper online so the best bet might be to wait for the SODA proceedings.</p>
<p>Now the big open question is whether one can beat the 2-approximation which is open even in the unweighted case.</p>
<h2>Shuffles and Circuits</h2>
<p>Best paper award at the 28th ACM Symposium on Parallelism in Algorithms and Architectures went to ‘‘<a href="http://theory.stanford.edu/~sergei/papers/spaa16-mrshuffle.pdf">Shuffles and Circuits</a>’’, a paper by Roughgarden, Vassilvitskii and Wang.
This paper emphasizes the difference between rounds of MapReduce and depth of a circuit.
Because some of the machines can choose to stay silent between the rounds a round of MapReduce can be more complex than a layer of a circuit as the machines sending input to the next round might depend on the original input data.
The paper shows that nevertheless the standard circuit complexity ‘‘degree bound’’ can be applied to MapReduce computation.
I.e. any Boolean function whose polynomial representation has degree <script type="math/tex">d</script> requires <script type="math/tex">\Omega(\log_s d)</script> rounds of MapReduce using machines with space <script type="math/tex">s</script>.
This implies an <script type="math/tex">\Omega(\log_s n)</script> lower bound on the number of rounds for computing connectivity of a graph.
The authors also make explicit a connection between the MapReduce model and <script type="math/tex">NC^1</script> (see definition <a href="https://en.wikipedia.org/wiki/NC_(complexity) ">here</a>) which implies that improving lower bounds beyond <script type="math/tex">\log_s n</script> for polynomially many machines would imply separating <script type="math/tex">P</script> from <script type="math/tex">NC^1</script>.</p>
<h2>Beating Counting Sketches for Insertion-Only Streams</h2>
<p>Both <a href="http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarCF.pdf ">CountSketch</a> and <a href="https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch">Count-Min Sketch</a>, which are textbook approximate data structures for storing very large dynamically changing numerical tables in small space, have been improved this year under the assumption that data in the table is only incremented.
These improvements are for the most common application of such sketches to ``heavy hitters’’– the task of recovering largest entries from the table approximately.
For CountSketch see <a href="http://researcher.watson.ibm.com/researcher/files/us-dpwoodru/bciw16.pdf">the paper</a> by Braverman, Chestnut, Ivkin, Woodruff from STOC’16 and for CountMin Sketch <a href="https://arxiv.org/abs/1603.00213">the paper</a> by Bhattacharyya, Dey and Woodruff from PODS’16.</p>
<h2>Optimality of the Johnson-Lindenstrauss Transform</h2>
<p>Two papers by <a href="https://arxiv.org/pdf/1609.02094v1.pdf ">Green Larsen and Nelson</a> and by <a href="http://www.cs.tau.ac.il/~nogaa/PDFS/compression3.pdf">Alon and Klartag</a> have resolved the question of proving optimality of the Johnson-Lindenstrauss transform.
Based on doing a projection on random low-dimensional subspace JL-transform is the main theoretical tool for dimensionality reduction of high-dimensional vectors.
As these papers show no low-dimensional embedding and furthermore no data structure can achieve better bit complexity than <script type="math/tex">\Theta(n \log n/\epsilon^2)</script> for <script type="math/tex">(1 \pm \epsilon)</script>-approximating all pairwise distances between <script type="math/tex">n</script> vectors in Euclidean space (for a certain regime of parameters).
This matches the Johnson and Lindenstrauss upper bound and improves an old lower bound of <script type="math/tex">\Omega\left(\frac{n \log n}{ \epsilon^2 \log 1/\epsilon}\right)</script> due to Alon.
Even though Alon’s argument is significantly simpler getting an optimal lower bound is a very nice achievement.</p>
<h2>Fast Algorithm for Edit Distance if It's Small</h2>
<p><a href="https://en.wikipedia.org/wiki/Edit_distance ">Edit distance</a> is one of the cornerstone metrics of text similairity in computer science. It can be computed in quadratic time using standard dynamic programming which is optimal assuming SETH due to the <a href="https://arxiv.org/abs/1412.0348 ">result of Backurs and Indyk</a>.
Edit distance also has a number of applications including comparing DNAs in computational biology.
In these applications it is usually reasonable to assume that edit distance is only interesting if it is not too large.
Unfortunately, this doesn’t help speed up the standard dynamic program.
A series of papers, including two papers from this year by <a href="http://iuuk.mff.cuni.cz/~koucky/papers/editDistance.pdf ">Chakraborty, Goldenberg and Koucky</a> (STOC’16) and
<a href="http://homes.soic.indiana.edu/qzhangcs/papers/focs16-ED.pdf ">Belazzogui and Zhang</a> lead to the following result: sketches of size <script type="math/tex">poly(K \log n)</script> bits suffice for computing edit distance <script type="math/tex">\le K</script>. Such sketches can be applied not just in centralized but also in distributed and streaming settings making it possible to compress input strings down to size that (up to logarithmic factors) only depends on <script type="math/tex">K</script>.</p>
<h2>Tight Bounds for Set Cover in Streaming</h2>
<p>Set Cover is a surprisingly powerful abstraction for a lot of applications that involve providing coverage for some set of terminals.
Given a collection of sets <script type="math/tex">S_1, \dots, S_m \subseteq [n]</script> the goal is to find the smallest cardinality subcollection of these sets such that their union is <script type="math/tex">[n]</script>, i.e. all of the underlying elements are covered.
In approximation algorithms a celebrated greedy algorithm gives an <script type="math/tex">O(\log n)</script>-approximation for this problem.
In streaming there has been a lot of interest lately in approximating classic combinatorial optimization problems in small space with Set Cover being one of the main examples.
For an overview from last year check Piotr Indyk’s <a href="https://www.youtube.com/embed/_4mM1UGI9Dg?list=PLqxsGMRlY6u659-OgCvs3xTLYZztJpEcW ">talk</a> from the <a href="http://grigory.us/mpc-workshop-dimacs.html ">DIMACS Workshop on Big Data and Sublinear Algorithms</a>.</p>
<p>As <a href="http://www.seas.upenn.edu/~sassadi/stuff/papers/tbfsscotscp-conf.pdf ">this STOC’16 paper</a> by Assadi, Khanna and Li shows savings in space for streaming Set Cover can only be proportional to the loss in approximation. In particular, if we are interested in computing Set Cover which is within a multiplicative factor <script type="math/tex">\alpha</script> of the optimum then:
1) for computing the cover itself space <script type="math/tex">\tilde \Theta(mn/\alpha)</script> is necessary and sufficient,
2) for just esimating the size space <script type="math/tex">\tilde \Theta(mn/\alpha^2)</script> is necessary and sufficient.</p>
<h2>Polynomial Lower Bound for Monotonicity Testing</h2>
<p>Finally a polynomial lower bound has been shown for adaptive algorithms for testing monotonicity of Boolean functions <script type="math/tex">f \colon \{0,1\}^n \rightarrow \{0,1\}</script>.
The lower bound implies that any algorithm that can tell whether <script type="math/tex">f</script> is monotone or differs from monotone on a constant fraction of inputs has to query at least <script type="math/tex">\tilde \Omega(n^{1/4})</script> values of <script type="math/tex">f</script>.
This result is due to <a href="https://arxiv.org/abs/1511.05053 ">Belovs and Blais</a> (STOC’16) and is in contrast with the upper bound of <script type="math/tex">\tilde O(\sqrt{n})</script> by Khot, Minzer and Safra from last year’s FOCS.
Probably the biggest result in property testing this year.</p>
<h2>Linear Hashing is Awesome</h2>
<p>While ‘‘<a href="http://ieee-focs.org/FOCS-2016-Papers/3933a345.pdf ">Linear Hashing is Awesome</a>’’ by Mathias Bæk Tejs Knudsen doesn’t fall into the traditional ‘‘sublinear algorithms for big data’’ category this paper still has some sublinear flavor because of its focus on very fast query times.
Linear hashing is a classic hashing scheme
<script type="math/tex">h(x) = ((ax + b) \mod p) \mod m</script>
where <script type="math/tex">a,b</script> are random. It is very often used in practice and discussed extensively in CLRS.
This paper proves that linear hashing <strike>is awesome</strike> results in expected length of the longest chain of only <script type="math/tex">O(n^{1/3})</script> compared to the previous simple bound of <script type="math/tex">O(\sqrt{n})</script>.</p>
<p>Finally, this paper also decisively wins my ‘‘Best Paper Title 2016’’ award.</p>
<h2>Looking forward to more cool results in 2017!</h2>
<p>There has been a lot of great results in 2016 and it’s hard to mention all of them in one post and I certainly might have missed some exciting papers. Here is a quick shout out to some other papers that were close to making the above list:</p>
<ul>
<li><a href="https://arxiv.org/abs/1507.04299 ">Tight Bounds for Data-Dependent LSH</a> by Andoni and Razenshteyn from SoCG'16.</li>
<li><a href="http://arxiv.org/abs/1603.05346 ">Optimal Quantile Estimation in Streams</a> by Karnin, Lang and Liberty from FOCS'16.
</li>
</ul>
<p>Happy 2017!</p>
<p><a href="http://grigory.us/blog/whats-new-in-big-data-theory-2016/">What's New in the Big Data Theory 2016</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on December 30, 2016.</p><![CDATA[The Binary Sketchman]]>http://grigory.us/blog/the-binary-sketchman2016-10-07T00:00:00+00:002016-10-07T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<p>In this post I will talk about some of my recent work with <a href="http://www.cis.upenn.edu/~kannan/">Sampath Kannan</a> and <a href="https://stat.mit.edu/people/elchanan-mossel/">Elchanan Mossel</a> on linear methods for binary data compression. The paper is <a href="http://eccc.hpi-web.de/report/2016/174/">available here</a>, slides from my talk at Penn are <a href="http://grigory.us/files/talks/penn16.pdf">here</a> and another talk at Columbia is <a href="http://www.cs.columbia.edu/theory/f16-theoryread.html#Grigory">coming up on Nov 21</a>.</p>
<p>Given very large data represented in binary format as a string of length <script type="math/tex">n</script>, i.e. <script type="math/tex">x \in \{0,1\}^n</script>
we are interested in a compression algorithm that can transform <script type="math/tex">x</script> into a much shorter binary string <script type="math/tex">y \in \{0,1\}^k</script>.
Here <script type="math/tex">k \ll n</script> so that we can achieve some non-trivial savings in space.
Moreover, if <script type="math/tex">x</script> changes in the future we would like to be able to update our compressed version of it (without having to store the original <script type="math/tex">x</script>).</p>
<p>Clearly compression introduces some loss making it impossible to recover certain properties of the original data from the compressed string.
However, if we know in advance which property of <script type="math/tex">x</script> we are interested in then efficient compression often becomes possible.
We will model the property of interest as a binary function <script type="math/tex">f:\{0,1\}^n \rightarrow \{-1,1\}</script> which labels all possible <script type="math/tex">x</script>’s with two labels.
So our goal will be to be able to: 1) perform this binary classification, i.e. compute <script type="math/tex">f(x)</script> using compressed data <script type="math/tex">y</script> only, 2) do this even if <script type="math/tex">x</script> changes over time – updates for us will be bit flips in the coordinates of <script type="math/tex">x</script> specified by the index of the bit that is getting flipped.</p>
<p>Finally, if <script type="math/tex">x</script> is so big that it can’t be stored locally and has to be divided into chunks stored across multiple machines then we will be able to compress the chunks locally and then combine them on a central server into a compressed version of the entire data – one simple round of MapReduce or whatever your favorite distributed framework is.</p>
<p>To make the above discussion less abstract let’s consider a machine learning application – evaluating a linear classifier over binary data.
Let’s say we have trained a linear classifier of the form <script type="math/tex">sign(\sum_{i = 1}^n w_i x_i - \theta)</script> where sign is the sign function.
Is it possible to compress <script type="math/tex">x</script> in such a way that we can still evaluate our classifier in the scenarios described above?
Turns out we can compress the input down to <script type="math/tex">O(\theta/m \log (\theta/m))</script> bits where <script type="math/tex">m</script> is a parameter of the linear classifier known as its margin. Moreover, no compression scheme can do better.</p>
<h1 id="introducing-the-binary-sketchman">Introducing the Binary Sketchman</h1>
<div align="center"><img alt="The Binary Sketchman" src="http://grigory.us/blog/pics/binary-sketchman-final.png" /> </div>
<p><br /></p>
<p>While the setting described above may seem quite challenging it can be handled through a framework of linear sketching.
In the binary case the interpretation of linear sketching is particularly simple as our binary sketchman is just going to compute <script type="math/tex">k</script> parities of the bits of <script type="math/tex">x</script>, say for <script type="math/tex">k=3</script>:</p>
<script type="math/tex; mode=display">x_4 \oplus x_2, \quad x_{42}, \quad x_{566} \oplus x_{610} \oplus x_{239} \oplus x_{57}.</script>
<p>In a matrix form this corresponds to computing <script type="math/tex">Mx</script> where <script type="math/tex">M</script> is a <script type="math/tex">k \times n</script> binary matrix and the operations are performed over <script type="math/tex">\mathbb F_2</script>.
Note that now our sketch easily satisfies all the requirement above since as <script type="math/tex">x</script> changes we can just update the corresponding parities. In the distributed case we can compute them locally and then add up on a central server.</p>
<p>Unfortunately the power of a deterministic sketchman who just uses a fixed set of parities is quite limited and no such sketchman can compress even a simple linear classifier down to less than <script type="math/tex">n</script> bits.
In fact, even for the OR function <script type="math/tex">f = x_1 \vee x_2 \vee \dots \vee x_n</script> no deterministic sketch can have less than <script type="math/tex">n</script> bits.
So our binary sketchman will “<a href="http://www.cs.cmu.edu/~haeupler/15859F14/">unleash the power of randomization</a>” in his quest for a perfect sketch.
According to <a href="http://www.cs.cmu.edu/~haeupler/">Bernhard Haeupler</a> this can be quite dramatic and looks kind of like this:</p>
<div align="center"><img width="300px" alt="The power of randomness unleashed" src="http://www.cs.cmu.edu/~haeupler/15859F14/images/posternoinf.jpg" /> </div>
<p><br />
So our sketchman will instead pick the matrix <script type="math/tex">M</script> randomly while the rest is the same as before.
Now the OR function is easy to handle: pick a parity over a random subset of <script type="math/tex">\{1, \dots, n\}</script> where each coordinate is included with probability <script type="math/tex">1/2</script>.
If <script type="math/tex">OR(x) = 1</script> then this parity catches a non-zero coordinate of <script type="math/tex">x</script> with probability <script type="math/tex">1/2</script> and thus evaluates to <script type="math/tex">1</script> with probability at least <script type="math/tex">1/4</script>.
If <script type="math/tex">OR(x) = 0</script> then the parity never evaluates to <script type="math/tex">1</script> so we can distinguish the two cases with probability <script type="math/tex">1 - \delta</script> using <script type="math/tex">O(\log 1/\delta)</script> such parities.
This illustrates a more general idea – if <script type="math/tex">f</script> is a constant function on all but <script type="math/tex">m</script> different inputs then a sketch of size <script type="math/tex">O(\log m + \log 1/\delta)</script> suffices.</p>
<p>Now for linear thresholds the high-level ideas behind this sketching process are as follows:
1) observe that any linear threshold function takes the same value on all but <script type="math/tex">n^{O(\theta/m)}</script> inputs,
2) apply the same argument as above to obtain a sketch of size <script type="math/tex">O(\theta/m \log n + \log 1/\delta)</script>.
The only thing missing in the above argument is that we still have dependence on <script type="math/tex">n</script>.
This can be avoided if we first hash the domain reducing its size down to <script type="math/tex">n' = poly(\theta/m)</script> which replaces <script type="math/tex">n</script> in the above calculations giving us <script type="math/tex">O(\theta/m \log \theta/m + \log 1/\delta)</script>.
While this compression method is quite simple the remarkable fact is that it can’t be improved.
Even for the simplest threshold function that corresponds to a threshold for the Hamming weight of <script type="math/tex">x</script>, i.e. <script type="math/tex">sign(\sum_{i = 1}^n x_i - k)</script>, any compression mechanism would require <script type="math/tex">\Omega(k \log k)</script> bits as follows from <a href="http://link.springer.com/chapter/10.1007/978-3-642-32512-0_44">this work</a> by Dasgupta, Kumar and Sivakumar.
Note that it isn’t assumed that the protocol is based on linear sketching – it can be an arbitrary scheme.</p>
<h1 id="the-power-of-randomized-binary-sketchman">The Power of Randomized Binary Sketchman</h1>
<p>Linear sketching by itself is not a new idea and has been studied extensively in the last two decades.
See surveys by <a href="http://researcher.watson.ibm.com/researcher/view.php?person=us-dpwoodru">Woodruff</a> and <a href="http://people.cs.umass.edu/~mcgregor/">McGregor</a> on how it can be applied to problems in <a href="http://researcher.ibm.com/files/us-dpwoodru/wNow3.pdf ">numerical linear algebra</a> and <a href="http://link.springer.com/referenceworkentry/10.1007/978-3-642-27848-8_796-1">graph compression</a>.
However, this work focuses on linear sketching over large finite fields (used to represent real values with bounded precision).
Nevertheless some striking results are known about linear sketching that are applicable in our context as well.
In particular, if <script type="math/tex">x</script> is updated through a very long (triply exponential in <script type="math/tex">n</script>) stream of adversarial updates then linear sketches over finite fields are optimal for any function <script type="math/tex">f</script> as shown by Li, Nguyen and Woodruff <a href="https://pdfs.semanticscholar.org/bf89/98d76741f3ee7b4ba1f82524353e7083c3b5.pdf ">here</a> in STOC’14.</p>
<p>As our paper shows the same result holds for much shorter random streams of length <script type="math/tex">\tilde O(n)</script> in a simple model where each update flips uniformly at random chosen coordinate of <script type="math/tex">x</script>.
In other words binary sketching is optimal if in the end of the stream the input <script type="math/tex">x</script> is uniformly distributed.
The proof of this fact is quite technical and relies on a notion of <i>approximate Fourier dimension</i> for Boolean functions that we use to characterize binary sketching under the uniform distribution – check the paper for details if you are interested.
Whether the same result holds for short (length <script type="math/tex">\tilde O(n)</script>, say) adversarial streams is the main open question left open.</p>
<p><a href="http://grigory.us/blog/the-binary-sketchman/">The Binary Sketchman</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on October 07, 2016.</p><![CDATA[Teaching “Foundations of Data Science”]]>http://grigory.us/blog/foundations-of-data-science-class2016-08-27T00:00:00+00:002016-08-27T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<p>This week I started teaching a graduate class called “<a href="http://grigory.us/data-science-class.html">Foundations of Data Science</a>” that will be mostly based on an eponymous book by <a href="https://en.wikipedia.org/wiki/Avrim_Blum ">Avrim Blum</a>, <a href="https://en.wikipedia.org/wiki/John_Hopcroft ">John Hopcroft</a> and <a href=" https://en.wikipedia.org/wiki/Ravindran_Kannan ">Ravi Kannan</a>.
The book is still a draft and I am using <a href="http://grigory.us/files/bhk-book.pdf">this version</a>.
Target audience includes advanced undergraduate and graduate level students.
We had some success using this book as a core material for an undergraduate class at Penn this Spring (<a href="http://www.thedp.com/article/2016/02/cis-399-students">link to the news article</a>).
The draft has been around for a while and in fact I ran a reading group that used it four years back when I was in grad school and the book was called
“Computer Science Theory for the Information Age”.</p>
<div align="center"><img width="200px" alt="Keep calm and dig foundations of Data Science" src="http://grigory.us/blog/pics/b609-poster-homepage.png" /> </div>
<p>“Data Science” is one of those buzzwords that can mean very different things to different people.
In particular, a new graduate <a href="http://www.soic.indiana.edu/graduate/degrees/data-science/index.html">Masters program in Data Science here at IU</a> attracts hundreds of students from diverse backgrounds.
What I personally really like about the Blum-Hopcroft-Kannan book is that it doesn’t go into any philosophy about the meaning of data science but rather offers a collection of mathematical tools and topics that can be considered as foundational for data science as seen from computer science perspective.
It should be noted that just as any “Foundations of Computing” class has little to do with finding bugs in your code so do this class and book have little to do with data cleaning and other data analysis routine.</p>
<h1>Topics</h1>
<p>While the jury is still out on what topics should be considered as fundamental for data science I think that the Blum-Hopcroft-Kannan book makes a good first step in this direction.</p>
<p>Let’s look at the table of contents:</p>
<ul>
<li>Chapter 2 introduces basic properties of the high-dimensional space, focusing on concentration of measure, properties of high-dimensional Gaussians and basic dimension reduction. </li>
<li>Chapter 3 covers the Singular Value Decomposition (SVD) and its applications (principal component analysis, clustering mixture of Gaussians, etc.).</li>
<li>Chapter 4 focuses on random graphs (primarily in the Erdos-Renyi model).</li>
<li>Chapter 5 introduces random walks and Markov chains, including Markov Chain Monte Carlo methods, random works on graphs and applications such as Page Rank.</li>
<li>Chapter 6 covers the very basics of machine learning theory, including learning basic function classes, perceptron algorithm, regularization, kernelization, support vector machines, VC-dimension bounds, boosting, stochastic gradient descent and a bunch of other topics. </li>
<li>Chapter 7 describes a couple of streaming and sampling methods for big data: frequency moments in streaming and matrix sampling.</li>
<li>Chapter 8 is about clustering methods: k-means, k-center, spectral clustring, cut-based clustering, etc.</li>
<li>Chapters 9 through 11 cover a very diverse set of topics that includes hidden Markov processes, graphical models, belief propagation, topic models, voting systems, compressed sensing, optimization methods and wavelets among others.</li>
</ul>
<h1>Discussion</h1>
<p>Overall this looks like a good stab at the subject and a big advantage of this book is that unlike some of its competitors it treats its topics with mathematical rigor.
The only chapter that I personally don’t really see fit into a “data science” class is Chapter 4. Because of its focus on the Erdos-Renyi model that I haven’t seen being used realistically for graph modeling applications this chapter seems to be mostly of purely mathematical interest.</p>
<p>Selection of some of the smaller topics is a matter of personal taste, especially when it comes to those that are missing.
A couple of quick suggestions is to cover new sketching algorithms for <a href="http://researcher.watson.ibm.com/researcher/files/us-dpwoodru/wNow.pdf">high-dimensional linear regression</a>, <a href="https://en.wikipedia.org/wiki/Locality-sensitive_hashing">locality-sensitive hashing</a> and possibly <a href="http://groups.csail.mit.edu/netmit/sFFT/index.html">Sparse FFT</a>.</p>
<p>Slides will be posted <a href="http://grigory.us/data-science-class.html#lectures">here</a> and I will write a report on the final selection of topics and my experience in the end of semester. Stay tuned :)</p>
<p><a href="http://grigory.us/blog/foundations-of-data-science-class/">Teaching “Foundations of Data Science”</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on August 27, 2016.</p><![CDATA[ESA'16 Deadline Approaching]]>http://grigory.us/blog/esa-20162016-04-18T00:00:00+00:002016-04-18T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<p>The deadline for submissions to <a href="http://conferences.au.dk/algo16/esa/">ESA’16</a> (24th European Symposium of Algorithms) is in 3 days.
As a PC member I would like to encourage you to submit your work and also plug the event and its location.</p>
<div align="center"><img alt="" src="http://grigory.us/blog/pics/esa16.png" /> </div>
<p><br /></p>
<p>This time the conference is a part of a broader symposium <a href="http://conferences.au.dk/algo16/home/">ALGO’16</a> which will take place in Aarhus, Denmark on August 22-26.
In the spirit of colocation <a href="http://grigory.us/blog/stoc-focs-proposal-colocate.html">previously advocated on this blog</a> this symposium brings together several conferences and workshops.
Most relevant to this blog are <a href="">ALGOCLOUD</a> (a new workshop on algorthms for cloud computing) and <a href="http://conferences.au.dk/algo16/massive/">MASSIVE</a> (a workshop on algorithms for massive data). A nice feature of MASSIVE is that it doesn’t have published proceedings. This means that contributions to the workshop can be also published in other conferences.</p>
<div align="center"><img alt="" src="http://grigory.us/blog/pics/algo16.png" /> </div>
<p><br /></p>
<p>Aarhus is definitely one of the most vibrant and forward-thinking centers for research in algorithms and theoretical computer science at large in Europe.
I was very lucky to visit the <a href="http://ctic.au.dk/">Center for the Theory of Interactive Computation</a> (CTIC) about 3 years ago.
This Sino-Danish center is a great example of a collaboration between Tsinghua University (the leading computer science institution in China) and its Western partners.</p>
<p>I really enjoyed spending a week at CTIC hosted by <a href="https://www.cs.swarthmore.edu/~brody/">Joshua</a> and <a href="http://web.mit.edu/matulef/www/">Kevin</a>.
Coincidentally a friendly soccer game between CTIC and MADALGO took place during my visit and I got drafted to play against algorithms folks.
MADALGO is another joint center (with MIT and MPI) and these guys clearly knew a better algorithm for soccer than we did.</p>
<p>MADALGO team:</p>
<div align="center"><img alt="" src="http://grigory.us/blog/pics/madalgo.jpg" /> </div>
<p><br /></p>
<p>CTIC team:</p>
<div align="center"><img alt="" src="http://grigory.us/blog/pics/ctic.jpg" /> </div>
<p><a href="http://grigory.us/blog/esa-2016/">ESA'16 Deadline Approaching</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on April 18, 2016.</p><![CDATA[The Simple Economics of Algorithms for Big Data]]>http://grigory.us/blog/the-simple-economics-of-algorithms-for-big-data2016-01-20T00:00:00+00:002016-01-20T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<p>
In this blog post I want to suggest a simple reason why you should study your algorithms <b>really</b> well if you want to design algorithms that deal with big data.
This reason comes from <b>the way billings offered by cloud services work</b>.
</p>
<p>
Maybe you remember yourself taking that algorithms class and thinking: “Who really cares if that algorithm uses a bit more time? Can't we just wait a little longer?”.
Or “Ok, we can save some space here, but if it all fits into my RAM anyway then why bother?”.
These are both great reasons not to care too much about efficiency of your algorithms if your data is small, fits into RAM and the running times aren't significant enough to matter anyway.
So you would go on to program your favorite video game and not care about that professor talking about all that big-Oh nonsense.
And in the short run you would be right. While you are developing a prototype of your favorite video game you shouldn't care.
When I was working at a startup I remember myself learning the hard way that <a href="http://c2.com/cgi/wiki?PrematureOptimization ">premature optimization is the root of all evil</a>.
</p>
<div align="center"><img alt="abstruse-goose-video-games" src="http://grigory.us/blog/pics/abstruse-goose-video-games.png" /> </div>
<p><br /></p>
<p>
However, once your video game becomes successful and you get to deal with big data that has to be stored and processed in the cloud this reasoning starts to fall short.
Let's say you developed <a href="https://en.wikipedia.org/wiki/Candy_Crush_Saga">Candy Crush Saga</a> (<a href="http://www.standard.co.uk/business/business-news/candy-crush-saga-owner-king-digital-entertainment-valued-at-7bn-9216058.html">valued at $7bn in 2014</a>) and now you are interested in doing some data analytics about your >10 million active users.
You are now considering outsourcing your data storage and computation to the cloud.
Here is where you might want to learn why the design of space and time-efficient algorithms matters for the bottom line of your future business.
<h1>100x more efficient algorithms = 100x less money in billings</h1>
So that time and space your professor was talking about – what does it have to do with your spending on the cloud services?
The answer is surprisingly simple – <b>if you need 100x more time and space then your billing increases 100 times</b>.
Below I used the pricing calculator that comes with Google Compute Engine to see how the cost scales if I want to use 100/1000/10000 identical machines for a year.
<div align="center"><img alt="abstruse-goose-video-games" src="http://grigory.us/blog/pics/cloud-pricings.png" /> </div>
<br />
<p>
I was myself surprised to find this out since I expected some economy of scale to kick in. In fact, sometimes it does but usually is quite negligible. Say, you can get an X% discount but that doesn't help much against linear scaling.
</p>
</p>
<p><a href="http://grigory.us/blog/the-simple-economics-of-algorithms-for-big-data/">The Simple Economics of Algorithms for Big Data</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on January 20, 2016.</p><![CDATA[Teaching algorithms for Big Data]]>http://grigory.us/blog/teaching-algorithms-for-big-data2015-12-24T00:00:00+00:002015-12-24T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<!--<h1>Teaching “algorithms for Big Data”</h1>
-->
<p>“algorithms for Big Data” (sometimes the name can slightly vary) is a new graduate class that has been introduced by many top computer science programs in the recent years.
In this post I would like to share my experience teaching this class at the University of Pennsylvania this semester. Here is the <a href="http://grigory.us/big-data-class.html">homepage</a>.</p>
<div align="center"><img alt="Keep calm and crunch data on o(N)" src="http://grigory.us/blog/pics/class-logo-large.png" /> </div>
<p><br /></p>
<p>First off, let me get the most frequently asked question out of the way and say that by “big data” in this class I mean data that doesn’t fit into a local RAM
since if the data fits into RAM then algorithms from the standard algorithms curricula will do the job.
At the moment a terabyte of data is already tricky to fit into RAM so this is where we will draw the line.
In particular, this is so that the <a href="http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html">arguments about beating algorithms for big data using your laptop</a> don’t apply.</p>
<p>Second, I tried to focus as much as possible on algorithms that are known to work in practice and have implementations.
Because this is a theory class we didn’t do programming but I made sure to give links to publicly available implementations whenever possible.
As it is always the case, the best algorithms to teach are never exactly the same as the best implementations.
Even the most vanilla problem of sorting an array in RAM is handled in C++ STL via a combination of QuickSort, InsertionSort and HeapSort.
Picking the right level of abstraction is always a delicate decision to make when teaching algorithms and I am pretty happy with the set of choices made in this offering.</p>
<p>Finally, “algorithms for Big Data” isn’t an entirely new phenomenon as a class since it builds on its predecessors
typically called “Sublinear Algorithms”, “Streaming Algorithms”, etc.
Here is a <a href="http://grigory.us/big-data-class.html#sketch">list of closely related classes offered at some other schools</a>.
In fact, my version of this class consisted of <a href="http://grigory.us/big-data-class.html#lectures">four modules</a>:</p>
<ul>
<li><b>Part 1: Streaming Algorithms.</b> It is very convenient to start with this topic since techniques developed in streaming turn out to be useful later. In fact, I could as well call this part “linear sketching” since every streaming algorithm that I taught in this part was a linear sketch. I find single-pass streaming algorithms to be the most motivated and for so-called dynamic streams that can contain both insertions and deletions linear sketches are known to be almost optimal under fairly mild conditions.
Moreover, linear sketches are the baseline solution in the more advanced massively parallel computational models studied later.
</li>
<li><b>Part 2: Selected Topics.</b> This part became very eclectic, containing selected topics in numerical linear algebra, convex optimization and compressed sensing.
In fact, some of the algorithms in this part aren't even “algorithms for Big Data” according to the RAM size based definition.
However, I considered these topics to be too important to skip in a “big data” class.
For example, right after we covered gradient descent methods for convex optimization Google released <a href="https://www.tensorflow.org/">TensorFlow</a>.
This state of the art machine learning library allows one to choose any of its <a href="https://www.tensorflow.org/versions/master/api_docs/python/train.html#optimizers">5 available versions</a> of gradient descent for optimizing learned models. These days when you can run into some <a href="https://aws.amazon.com/machine-learning/pricing/">pretty steep pricing</a> for outsourcing your machine learning to the cloud knowing what is under the hood of free publicly available frameworks I think is increasingly important.
</li>
<li><b>Part 3: Massively Parallel Computation.</b> I am clearly biased here, but this is my favorite. Unlike, say, streaming where many results are already tight, we are still quite far from understanding full computational power of MapReduce-like systems. Potential impact of such algorithms I think is also likely to be the highest. In this class because of the time constraints I only touched the tip of the iceberg. This part will be expanded in the future.</li>
<li><b>Part 4: Sublinear Time Algorithms.</b> I always liked clever sublinear time algorithms, but for many years believed that they are not quite “big data“ since they operate under the assumption of random access to the data. Well, this year I had to change my mind after Google launched its <a href="https://code.google.com/codejam/distributed_index.html">Distributed Code Jam</a>.
I have to admit that I have no idea how this works on the systems level but apparently it is possible to implement reasonably fast random access to large data.
The problems that I have seen being used for Distributed Code Jam allow one to use 100 nodes each having small RAM. The goal is to process a large dataset available via random access.
</li>
</ul>
<p>Overall parts 1 and 4 are by now fairly standard. Part 2 has some new content from <a href="http://researcher.watson.ibm.com/researcher/files/us-dpwoodru/journal.pdf">David Woodruff’s great new survey</a>. Some algorithms from it are also available in IBM’s <a href="https://github.com/xdata-skylark/libskylark">Skylark library for fast computational linear algebra and machine learning</a>.
Part 3 is what makes this class most different from most other similar classes.</p>
<h1>Mental Notes</h1>
<p>Here is a quick summary of things I was happy with in this offering + potential changes in the future.</p>
<ul>
<li><b>Research insights.</b> One of the main reasons why I love teaching is that it often leads to research insights, especially when it comes to simple connections I have been missing. For example, I didn't previously realize that one can use <a href="http://grigory.us/files/publications/BRY14-Lp-Testing.pdf">L<sub>p</sub>-testing</a> as a tool for testing assumptions about convexity and Lipschitzness used in the analysis of the convergence rate of gradient descent methods. </li>
<li><b>Project.</b> Overall I am very happy with the students' projects.
Some students implemented algorithms, some wrote surveys and some started new research projects.
Most unexpected to me were the projects done by non-theory students connecting their areas of expertise with the topics discussed in the class. E.g. surveys of streaming techniques used in natural language processing and bionformatics were really fun to read.</li>
<li><b>Cross-list the class for other departments.</b> It was a serious blunder on my behalf to not cross-list this class for other departments, especially Statistics and Applied Math.
Given how much interest there is from other fields this is probably the easiest to fix and the most impactful mistake.
Somehow some students from other departments learned about the class anyway and expressed their interest, often too late.</li>
<li><b>New content.</b> Because of time constraints I couldn't fit in some of the topics I really wanted to cover.
These include coresets (there has been a resurgence of interest in coresets for massively parallel computing, but I didn't have time to cover it), nearest neighbor search (somehow I couldn't find a good source to teach from, suggestions are very welcome), Hyperloglog algorithm (same reason), more algorithms for massively parallel computing (no time), more sublinear time algorithms (no time).
In the next version of this class I will make sure to cover at least some of these.
</li>
<li><b>Better structure.</b> Overall I am pretty happy with the structure of the class but there is definitely room for improvement. A priority will be to better incorporate selected topics discussed in Part 2 into the overall structure of the class. In particular, convex optimization came a little out of the blue even though I am really glad I included it.</li>
<li><b>Slides and equipment.</b> I really like teaching with slides that contain only some of the material and use the blackboard to fill in the missing details and pictures.
On one hand, slides are a backbone that the students can later use to catch up on the parts they missed. On the other hand, the risk of rushing through the slides too fast is minimized since the details are discussed on the board. Also a lot of time is saved on drawing pictures. I initially used Microsoft Surface Pro 2 to fill in the gaps on the tablet instead of the board but later gave up on this idea because of technical difficulties. Having a larger tablet would help too. I still think that the tablet can work but requires a better setup. Next time I will try to use the tablet again and post the final slides online.
</li>
<li><b>Assign homework and get a TA.</b> Michael Kearns and I managed to teach “Computational Learning Theory” without a TA last semester so I decided against getting one for my class as well. This was fine except that having a TA for grading homework would have helped a lot.</li>
<li><b>Make lecture notes and maybe videos.</b> With fairly detailed slides I didn't consider lecture notes necessary. Next time it would be nice to have some since some of my fellow facutly friends asked for them. I think I will stick with the tested “a single scribe per lecture“ approach although I heard in France students sometimes collaboratively work on the same file during the lecture and the result comes out nice. When I had to scribe lectures I just LaTeXed them on the fly so I don't see why you can't do this collaboratively.
As for videos, Jelani had <a href="http://people.seas.harvard.edu/~minilek/cs229r/fall15/lec.html">videos</a> from his class this time and they look pretty good. </li>
<li><b>Consider MOOCing.</b> Given that the area is in high demand doing a MOOC in the future is definitely an option. It would be nice to stabilize the content first so that the startup cost of setting up a MOOC could be amortized by running it multiple times.</li>
</ul>
<h1>Thanks</h1>
<p>I am very grateful to my friends and colleagues discussions with whom helped me a lot while developing this class.
Thanks to Alex Andoni, Ken Clarkson, Sampath Kannan, Andew McGregor, Jelani Nelson, Eric Price, Sofya Raskhodnikova, Ronitt Rubinfeld and David Woodruff (this is an incomplete list, sorry if I forgot to mention you). Special thanks to all the students who took the class and <a href="http://www.seas.upenn.edu/~sassadi/">Sepehr Assadi</a> who gave a guest lecture on our <a href="http://arxiv.org/pdf/1505.01467.pdf">joint paper about linear sketches of approximate matchings</a>.</p>
<p><a href="http://grigory.us/blog/teaching-algorithms-for-big-data/">Teaching algorithms for Big Data</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on December 24, 2015.</p><![CDATA[Slides and Videos from DIMACS]]>http://grigory.us/blog/dimacs-materials2015-10-29T00:00:00+00:002015-10-29T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<p>Slides and videos from the DIMACS workshop “Big Data through the Lens of Sublinear Algorithms”
are now available (<a href="http://grigory.us/mpc-workshop-dimacs.html">link</a>).
In case you missed it, this was a great opportunity to catch up on the latest and hottest results in the field.
We were lucky to have a healthy mix of speakers from both academia and industry (represented by researchers from Microsoft, IBM, Google and Yahoo!). I was particularly excited to see talks on both traditional models for sublinear computation (streaming, property testing, etc.) as well as more recent ones (here my own favorites are MapReduce and other modern distributed models).</p>
<p>All keynotes, tutorials and regular talks were great. Among regular talks let me highlight two that were in some ways outliers:</p>
<ul>
<li>Vahab Mirrokni talked about problems and frameworks for large-scale data mining at Google Research NYC (<a href="https://www.youtube.com/watch?v=w7zc1OpN9gk&feature=youtu.be&list=PLqxsGMRlY6u659-OgCvs3xTLYZztJpEcW">video</a>). I really wish this could be a longer talk.</li>
<li>
Jelani Nelson from Harvard gave a quick tutorial on chaining (<a href="https://www.youtube.com/watch?v=6gfrr5VEbtc&feature=youtu.be&list=PLqxsGMRlY6u659-OgCvs3xTLYZztJpEcW">video</a>). From this tutorial you can also learn about applications of chaining to instance-dependent Johnson-Lindenstrauss dimensionality reduction using Gaussian mean width which I didn't know and found really cool. Jelani is organizing a workshop on related topics at Harvard that will take place on Jun 22–23 (after STOC). </li>
</ul>
<p>Kicking off 2016 is another <a href="http://www.cs.jhu.edu/~vova/sublinear2016/program.html">sublinear algorithms workshop</a> at Johns Hopkins University (Jan 7–9, right before SODA in Arlington, VA).</p>
<p><a href="http://grigory.us/blog/dimacs-materials/">Slides and Videos from DIMACS</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on October 29, 2015.</p><![CDATA[East Coast Workshops]]>http://grigory.us/blog/east-coast-workshops2015-08-19T00:00:00+00:002015-08-19T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<p>Two events that might be of interest to the readers of this blog are happening on the East Coast next week.</p>
<p>On Monday–Wednesday (Aug 24–26) a <a href="http://cmsa.fas.harvard.edu/big-data/">conference on big data</a> is taking palce at Harvard.
This looks like a very exciting event with broad representation of research from different areas including a lot of theory and algorithms.</p>
<p>On Thursday–Friday (Aug 27–28) DIMACS at Rutgers will host a <a href="http://grigory.us/mpc-workshop-dimacs.html">workshop on sublinear algorithms and big data</a> which will be more focused on the algorithmic questions.
As an organizer, I would like to remind that the early registration and poster submission deadlines for this workshop are <b>tomorrow</b> (Aug 20).
Note, that in many cases for local researchers, students and postdocs affiliated with partners of DIMACS the registration fee can be either significantly reduced or waived.</p>
<p>We are hoping to see some of you at this workshop!</p>
<p><a href="http://grigory.us/blog/east-coast-workshops/">East Coast Workshops</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on August 19, 2015.</p><![CDATA[algorithms for Big Data]]>http://grigory.us/blog/big-data-class2015-08-16T00:00:00+00:002015-08-16T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<p>Next week I am launching at Penn a new graduate class called “algorithms for Big Data”.
Really excited to be teaching my first full class which was in the works for a while. Last semester I co-taught “<a href="http://www.cis.upenn.edu/~mkearns/teaching/COLT/">Computational Learning Theory</a>”
with <a href="http://www.cis.upenn.edu/~mkearns/ ">Michael Kearns</a> which was a great experience but developing a new class on my own was even more entertaining.
The homepage is <a href="http://grigory.us/big-data-class.html">here</a>.</p>
<div align="center"><img alt="" width="50%" style="border:3px solid black" src="http://grigory.us/blog/pics/class-poster.png" /> </div>
<p><br />
Tentative list of topics is <a href="http://grigory.us/big-data-class.html#plan">available</a> and I will appreciate any comments/suggestions.
Among other related “big data theory” classes listed <a href="http://grigory.us/big-data-class.html#sketch">here</a> my class will be one of the most
focused on distributed algorithms for clusters and Hadoop/Mapreduce. E.g., most of the streaming and dimensionality reduction techniques introduced in the first parts of the class serve primarily as an introduction into linear sketching which works in the distributed context as well.</p>
<p>On a related note, multiple shout-outs to Google who makes its <a href="https://cloud.google.com/compute/">Compute Engine</a> available for 2 months of free trial with a $200 credit.
The demos in my class will be run on this platform which I think is the friendliest among the competitors.</p>
<p>I was also really excited to find out that this year Google has launched the first large online distributed algorithm competition that I am aware of – <a href="https://code.google.com/codejam/distributed_index.html">Distributed Code Jam</a>.
I’ve been expecting this for a while and now you can finally get your hands on a nice set of algorithmic problems in distributed computing.
The solutions are executed on 100 machines in parallel which allows to process inputs of 10<sup>9</sup> records easily.
<a href="https://code.google.com/codejam/contest/4264486/dashboard">Practice Round problems</a> include some classic theoretical problems such as distributed majority computation and finding a path on a cycle.</p>
<!--<h1>Title</h1>-->
<p><a href="http://grigory.us/blog/big-data-class/">algorithms for Big Data</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on August 16, 2015.</p><![CDATA[Colocate, Colocate, Colocate]]>http://grigory.us/blog/stoc-focs-proposal-colocate2015-06-02T00:00:00+00:002015-06-02T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<p>Adding my two cents to the discussion of the new format for STOC/FOCS conferences I would like to propose only one change which I think is also fairly modest: colocate, colocate, colocate. Well, I agree that it sounds like three changes — the point is that the more colocation the better :) In fact, I realized that I once again agree with Matt Welsh (who recently proposed a similar change for conferences in his community <a href="http://matt-welsh.blogspot.com/2015/05/a-modest-proposal-sosigcommobixdi.html">here</a>) which often happens when he takes a break from bashing academia.
Here are a few fairly straightforward reasons why I think colocation of multiple conferences at the same location and similar time is good:</p>
<ul>
<li> It has been tested and already works pretty well at FCRC. There are things that can happen at a scale of multiple communities that don't happen at the scale of just theory conferences. This year FCRC is hosting SPAA/EC/CCC as well as a few other conferences which might be of interest to theorists.
Possible synergies between different communities can be in the form of joint workshops, tutorials, keynotes, award lectures, etc. E.g., I hope that the <a href="http://grigory.us/mpc-workshop-fcrc.html">workshop on massively parallel algorithms</a> that I am co-organizing will benefit a lot from colocation with other conferences at FCRC. Overall, I am pretty sure these advantages are already fairly well understood.
</li>
<li>Increased number of options among possible talks to attend. I am sure almost everyone has been in a situation when there is nothing interesting happening at their favorite conferences. I would personally much rather attend a great talk on a new topic I don't know much about (even if it is applied) than sit through a mediocre STOC/FOCS talk.
</li>
<li>
Less travel. Well, at a certain stage of their career I believe many of us would like to have to travel less.
Now a more subtle aspect here is that there are conferences that I would really like to attend but I don't submit my papers there (e.g. EC, ICML, COLT), so I would really love to see them colocated with other conferences that I usually attend.
</li>
<li>
No structural changes to the format of existing conferences. This eliminates all concerns associated with allocation of credit for publications, presentations, etc. thus ensuring backwards compatibility.
</li>
</ul>
<h1>What to colocate?</h1>
<p>A possible idea for colocation might be to change the set of colocated conferences in different years which creates a lot of opportunities.
Here are some concrete proposals and I am pretty sure you can come up with more:</p>
<ul>
<li><b>STOC+FOCS+...</b> Possible proposals for ... are: CCC, SPAA, PODC, EC, SOCG, ICALP, COLT, ICML, SIGMOD, PODS since they happen around the same time. </li>
</li>
<li><b>SODA+ITCS</b>. I really try to attend both conferences whenever I can given that SODA and ITCS always happen back to back. Without colocation this always creates a seemingly unnecessary logistical overhead. In fact, I first heard this proposal from researchers at Google NYC who strongly supported it.
</li>
<li>
<b>Other conferences</b>. Some of the conferences that don't quite fit in given the time of the year when they usually happen but I would personally love to see colocated in some way: NIPS, VLDB, KDD, CIKM, WSDM, ICDM. I am pretty sure some people have their own list too (e.g. crypto conferences).
</li>
</ul>
<p><a href="http://grigory.us/blog/stoc-focs-proposal-colocate/">Colocate, Colocate, Colocate</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on June 02, 2015.</p><![CDATA[Upcoming Workshops]]>http://grigory.us/blog/upcoming-workshops2015-05-27T00:00:00+00:002015-05-27T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<p>As promised in the New Year’s post this year there are a lot of activities related to sublinear algorithms and big data.
On behalf of their organizers (<a href="http://www.mit.edu/~andoni/ ">Alex</a>, <a href="http://web.stanford.edu/~ashishg/ ">Ashish</a>, <a href="http://www.cs.rutgers.edu/~muthu/ ">Muthu</a>, <a href="http://theory.stanford.edu/~sergei/ ">Sergei</a> and myself) I would like to invite the readers of this blog to attend them and spread the word.</p>
<!--
<div align="center"><img alt="Happy 2015!" src="http://grigory.us/blog/pics/2015.png"> </div>
-->
<div>
<ul class="fa-ul">
<li> <i class="fa li fa fa-group"></i>
On June 14, <a href="http://fcrc.acm.org/ ">FCRC 2015</a> will hold a workshop “<a href="http://grigory.us/mpc-workshop-fcrc.html">Algorithmic Frontiers of Modern Massively Parallel Computation</a>”.
The program is available <a href="http://grigory.us/mpc-workshop-fcrc.html#schedule">here</a>.
We are very excited to have a great lineup of speakers including:
<ul>
<li><a href="http://www.cs.cmu.edu/~ninamf/">Nina Balcan</a> (CMU)</li>
<li><a href="http://www.cs.washington.edu/people/faculty/beame">Paul Beame</a> (University of Washington)</li> <li><a href="https://sites.google.com/site/ravik53/ ">Ravi Kumar</a> (Google Research, CA)</li> <li><a href="http://people.csail.mit.edu/mirrokni/Welcome.html ">Vahab Mirrokni</a> (Google Research, NYC)</li> <li><a href="http://research.engineering.wustl.edu/~bmoseley/ ">Ben Moseley</a> (Washington University, St. Louis)</li><li> <a href="http://onak.pl">Krzysztof Onak</a> (IBM Research, NY)</li>
</ul>
To spice things up, Michael Stonebraker will be giving his Turing Award lecture right after the workshop.
</li>
<br />
<li> <i class="fa li fa fa-group"></i> On August 27-28 <a href="http://dimacs.rutgers.edu/">DIMACS</a> at Rutgers will host a 2-day <a href="http://dimacs.rutgers.edu/Workshops/ParallelAlgorithms/ ">workshop on massively parallel and sublinear algorithms</a>.
This workshop will feature keynote talks by:
<ul>
<li> <a href="http://web.stanford.edu/~ashishg/ ">Ashish Goel</a> (Stanford)</li> <li><a href="https://people.csail.mit.edu/indyk/">Piotr Indyk</a> (MIT)</li><li><a href="http://hunch.net/~jl/ ">John Langford</a> (Microsoft Research, NYC)</li></ul>
We will also have tutorials by:
<ul>
<li><a href="http://theory.stanford.edu/~sergei/">Sergei Vassilvitskii</a> (Google Research, NYC)</li>
<li><a href="http://researcher.watson.ibm.com/researcher/view.php?person=us-dpwoodru">David Woodruff</a> (IBM Research, Almaden)</li>
</ul>
This workshop will be right after RANDOM/APPROX at Princeton to make it convenient to attend both (especially if you are traveling internationally).
</li>
</ul>
<p>
I might be forgetting to mention some other events that I haven't been directly involved in, so please comment if there is anything else coming up.
</p>
<p>
Also, as a flashback I would like to mention the second “<a href="http://www.gautamkamath.com/sublinearday/">Sublinear Algorithms and Big Data Day</a>” that took place at MIT on April 10.
One of the highlights of the event was the poster session which featured a large number of exciting new results in the field. The full list is available <a href="http://www.gautamkamath.com/sublinearday/posters.txt ">here</a>. This was probably the most successful poster session I have ever been to and we plan to continue this tradition next year.
Thanks again to <a href="http://www.gautamkamath.com/">Gautam</a>, <a href="http://people.csail.mit.edu/costis/ ">Costis</a> and <a href="http://people.csail.mit.edu/indyk/">Piotr</a> for organization and support!
</p>
</div>
<p><a href="http://grigory.us/blog/upcoming-workshops/">Upcoming Workshops</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on May 27, 2015.</p><![CDATA[Modern Algorithms or The Brave New O of the Big N]]>http://grigory.us/blog/modern-intro-algorithms2015-05-09T00:00:00+00:002015-05-09T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<h1>The Role of Algorithms</h1>
<p>Driven by booming enrollments in Computer Science, the first theoretical class that the majors usually take, “Introduction to Algorithms”, is also experiencing unprecedented growth.
At top schools this is evidenced by the fact that dozens of TAs are now employed to teach this class to several hundreds of students each year.
There are multiple reasons making “Introduction to Algorithms” one of the cornerstones of the Computer Science curriculum. One of the key roles that it plays for many students is serving as their first introduction into fully rigorous analysis
of the performance of computer programs.
It teaches the students how to use the rigorous mathematical lens to see abstract structure behind the data that they haven’t seen before.</p>
<div align="center"><img alt="The Brave New O of N" src="http://grigory.us/blog/pics/oofn.jpg" /> </div>
<p>Furthermore, it introduces a basic set of tools that can be used to process large amounts of data regardless of any assumptions about the generation process used to create it (worst-case analysis).
Unlike other popular approaches to algorithm design such as machine learning and its newest incarnation called “deep learning”, the core algorithms curriculum gives solutions which use no training data and behave robustly under
any changes in their input.</p>
<h1>What Should the “Introduction to Algorithms” Look Like?</h1>
<p>The question asked here is provocative and doesn’t have a hard and fast answer for multiple reasons.
First, the answers vary quite a bit depending on who you ask. I will illustrate this point later by comparing the curricula used at some of the top schools.
Hence, I will stress first that all ideas expressed below are a matter of my personal taste.
From this point on all opinions expressed in this post are about how I would teach introduction to algorithms rather than suggesting the reader to do the same.
In fact, I believe that in the U.S. and all over the world we are very lucky to have enough diversity to create curricula which look quite different from each other thus giving students more options.
While the most fundamental basics are roughly the same, the choice of advanced topics is often driven by the instructor’s research interests.
For those interested in pursuing a research career this gives an opportunity to get involved in research early on.</p>
<p>Second, unlike more traditional subjects such as maths and physics, the subject itself is rapidly evolving.
My rough estimate from looking at the history would be that once every 10-15 years a significant part of the curriculum has to undergo a shake-up.
This is another reason why having an instructor who is an active researcher in the area is critical for keeping up with developments in the field.
Stale curricula can even sometimes create <a href="http://nlpers.blogspot.com/2014/10/machine-learning-is-new-algorithms.html">room for doubt</a> in whether algorithms are still relevant or some other class can be used as a replacement.
While there is <a href="http://blog.geomblog.org/2014/10/algorithms-is-new-algorithms.html ">hardly any doubt</a> that rigorous analysis of algorithms will be relevant for many years to come, concerns such as the one above can be seen as a call for action.</p>
<p>Despite the two fundamental challenges discussed above, I believe that there are some guiding principles that can be used to determine the choice of topics for the introductory classes.
The first one is simplicity and clarity of the underlying ideas.
The second one is them passing the test of time and being implemented and used in a variety of software packages. This process serves as a “natural selection” for algorithmic ideas.
A 10-15 year period is usually enough for the hype around hot topics to settle down.
Finally, universality and robustness to the choice of a particular model or architecture also play an important role.
This is probably the hardest principle to use since it involves predicting the future.</p>
<div align="center"><img alt="The Future of Algorithms?" src="http://grigory.us/blog/pics/the-graduate-plastics.jpg" /> </div>
<p><br /></p>
<h1>The Shoulders of Giants</h1>
<h2>Books</h2>
<p>Now let’s briefly discuss the existing literature and curricula at the top schools.</p>
<div align="center"><img alt="CLRS" src="http://grigory.us/blog/pics/clrs3.jpeg" /> </div>
<p>Probably the most canonical textbook on algorithms is the MIT book known as <a href="http://www.amazon.com/gp/product/0262033844/ref=s9_simh_gw_p14_d3_i2?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-1&pf_rd_r=0MCC9YB55GWMGAKXT26S&pf_rd_t=36701&pf_rd_p=2079475242&pf_rd_i=desktop">CLRS</a> (first published in 1990, the most recent third edition came out in 2009).
I’ve got my first edition in high school back in 2003.
At the time this book was quite a breakthrough compared to the previous generation of textbooks such as <a href="http://www.amazon.com/Data-Structures-Algorithms-Alfred-Aho/dp/0201000237 ">Aho-Hopcroft-Ullman</a>’s and <a href="http://www.amazon.com/gp/product/032157351X/ref=pd_lpo_sbs_dp_ss_3?pf_rd_p=1944687702&pf_rd_s=lpo-top-stripe-1&pf_rd_t=201&pf_rd_i=0201000237&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=0W96SX509C0HBREXWSN3 ">Sedgewick</a>’s.
I was heavily influenced by CLRS, subsequently using it to teach introductory classes for high school students in mid-late 00’s.</p>
<p>The fact that almost all algorithms from CLRS can be implemented as is using general purpose programming languages (C++, Java, Python) also made it very popular in the programming community including the competitive part of it.
E.g. in the Russian summer camps for high school students CLRS formed the core of the B/C-level classes while A/B-level classes cover topics roughly similar to Erik Demaine’s <a href="http://courses.csail.mit.edu/6.854/current/ ">Advanced Algorithms</a> and <a href="https://courses.csail.mit.edu/6.851/spring14/ ">Advanced Data Structures</a>.
I don’t have the data but will be very surprised if CLRS didn’t sell more copies than any other algorithms textbook ever published.
A recent testament to the popularity of CLRS is the fact that its first author <a href="http://www.quora.com/Thomas-Cormen-1">Thomas Cormen</a> is about as popular on <a href="http://quora.com/">Quora</a> as the <a href="http://www.quora.com/Barack-Obama">President of the United States</a> (this fact probably tells more about the kind of people who are active on Quora though).</p>
<p>Over years multiple alternatives have emerged, among which I would like to mention two: the Berkeley-UCSD “<a href="http://www.amazon.com/Algorithms-Sanjoy-Dasgupta/dp/0073523402/ref=sr_1_1?ie=UTF8&qid=1431203159&sr=8-1&keywords=papadimitriou+vazirani">Algorithms</a>” by Dasgupta, Papadimitriou and Vazirani and the Cornell’s “<a href="http://www.amazon.com/Algorithm-Design-Jon-Kleinberg/dp/0321295358/ref=sr_1_1?ie=UTF8&qid=1431203094&sr=8-1&keywords=kleinberg+tardos ">Algorithm Design</a>” by Kleinberg and Tardos.
Both books were published in 2005-06 and to the best of my knowledge second editions aren’t available yet.
One of the main differences between these newer books and CLRS is their concise style and focus on high-level ideas rather than low-level details.
However, all the books discussed above are starting to show their age.
A litmus test is the fact that they either don’t mention <a href="http://en.wikipedia.org/wiki/Chernoff_bound">Chernoff bounds</a> at all or mention them as an exercise or in one of the last chapters where they are barely used. I would expect a modern algorithms textbook to introduce concentration bounds early on and then use them heavily throughout the course.</p>
<p>Recently among textbooks I haven’t seen any strong newcomers, which might be partly due to the fact that books are somewhat passe these days (the only notable exception off the top of my head is a recent book “<a href="http://www.cs.cornell.edu/jeh/book11April2014.pdf ">Foundations of Data Science</a>” by Hopcroft and Kannan which is very interesting but has a somewhat different goal so I don’t see how an introductory algorithms class can be based solely on it).</p>
<h2>Courses</h2>
<p>In search for a modern algorithms curriculum let’s now turn to the classes taught recently at some of the schools in the U.S. whose class pages are publicly available.
At some schools different instructors teach the class in different years, so here I just picked one at random to save space.
Certain topics appear consistently in all of these classes (sorting/median, hashing, dynamic programming, greedy algorithms, cuts and flows, BFS/DFS, union-find, MST, FFT, shortest paths, etc.) so I will focus on the differences which make these classes unique.</p>
<ul>
<li> At MIT the “Design and Analysis of Algorithms” class is taught by Erik Demaine.
Here is the <a href="http://stellar.mit.edu/S/course/6/sp15/6.046J/materials.html ">most recent page</a>.
Erik is one of the best living experts on data structures. No surprise his class is a little heavy on cool data structures, including Van Emde Boas trees, Skip Lists and Range Trees which aren't usually present in a typical algorithms curriculum.
</li>
<li>
At CMU the “Algorithms” class was recently taught by Anupam Gupta and Danny Sleator, <a href="https://www.cs.cmu.edu/~15451/schedule.html">page here</a>.
This is a very interesting class where the instructors made a great effort including some modern topics such as linear programming, zero-sum games, streaming algorithms for big data, online algorithms, machine learning, gradient descent together with some advanced data structures (splay trees and segment trees).
</li>
<li>At Berkeley the class was recently taught by David Wagner, <a href="http://www-inst.eecs.berkeley.edu/~cs170/fa14/ ">page here</a>.
The class is based on the DPV book and also serves as an introduction into Theoretical Computer Science (primarily because it discusses in detail NP-completeness, which is either not present in other classes or only mentioned briefly).
The non-standard topics include an intro to machine learning, streaming algorithms (CountMin sketch) and PageRank.
</li>
<li>
At Cornell the class was recently taught by Eva Tardos and David Steurer, <a href="http://www.cs.cornell.edu/courses/CS4820/2015sp/lectures/ ">page here</a>.
Not surprisingly, the class is heavily KT-based.
Among unusual topics there is a lot of NP-hardness and computability (Turing machines, Church-Turing, undecidability, etc.) + a large module on approximation algorithms.
Modern topics include Nash equilibria, best expert algorithm (multiplicative weights) and stable matching.
Overall, this class has a strong bias towards foundations and approximation algorithms + an AGT/learning spin to it.
</li>
<li> At Stanford the class is taught this semester by Virginia Williams, <a href="http://web.stanford.edu/class/cs161/syllabus.html">page here</a>.
This is a traditional CLRS-based class. Since Stanford is on a quarter system this class is shorter than others. For more advanced algorithms courses at Stanford see <a href="http://web.stanford.edu/class/cs168/index.html">CS168</a>, <a href="http://theory.stanford.edu/~tim/cs261/cs261.html ">CS261</a>, <a href="http://theory.stanford.edu/~virgi/cs267/">CS267</a> and <a href="http://theory.stanford.edu/~virgi/cs367/index.html ">CS367</a>. In particular, CS168, "<a href="http://web.stanford.edu/class/cs168/index.html">The Modern Algorithmic Toolbox</a>" is a great example of an advanced modernized algorithms class. According to private channels a modernized version of the algorithms curriculum is currently under construction at Stanford.
</li>
<li>At Harvard the Data Structures and Algorithms class is taught by Jelani Nelson, <a href="http://sites.fas.harvard.edu/~cs124/cs124/syllabus.html ">page here</a>.
This is also a fairly traditional CLRS/KT-based class with a touch of linear programming and approximation algorithms.
</li>
<li>At UIUC the class is taught by Jeff Erickson whose <a href="http://web.engr.illinois.edu/~jeffe/teaching/algorithms/ ">lecture notes</a> basically form a book.
Non-standard topics include matroids, heavy emphasis of randomized algorithms and amortized data structures,
</li>
</ul>
<h1>The Brave New O of the Big N</h1>
<p>Finally, I would like to suggest some ideas for a modern algorithms curriculum.
As I mentioned in the motivational discussion above I believe that there are three fundamental guidelines: simplicity, implementability / test of time and potential for the future.
None of the proposed topics is particularly new and all of them have been tested in advanced graduate level classes at different schools with accessible expositions available.
Petabytes of data are getting crunched daily using these techniques and most of them have been implemented in a variety of software packages.</p>
<ul>
<li><b>Randomized and approximation algorithms.</b> Concentration bounds and tail inequalities early on. Examples of simple randomized and approximation algorithms that are actually used in practice, e.g PageRank, Set Cover, etc. There is a lot of mileage in these basic algorithms. </li>
<li><b>Linear programming.</b> LP basics/duality + approximation algorithms.
Since this topic has already made it into a large number of courses discussed above, I won't discuss it in much detail.
Considering LP-solvers as a solution that is available for a wide class of problems is an implicit goal achieved here. </li>
<li><b>Basics of machine learning and learning theory.</b> Core learning ideas: perceptron, boosting, VC-dimension, multiplicative weights. In order to strengthen connections with machine learning one can emphasize clustering problems in other parts of the course (SVD, k-means, single-linkage clustering, nearest neighbor, etc.) </li>
<li><b>Linear sketching.</b> This is probably the most recent topic (see these <a href="http://users.dcc.uchile.cl/~pbarcelo/mcg.pdf ">two</a> <a href="http://researcher.watson.ibm.com/researcher/files/us-dpwoodru/wNow.pdf ">surveys</a> by Andrew McGregor and David Woodruff), but I strongly believe that by now the field is mature enough to be covered in the intro class.
A good example of a linear sketch is the <a href="http://en.wikipedia.org/wiki/Count%E2%80%93min_sketch ">CountMin</a> data structure.
It is a stronger version of the <a href="http://en.wikipedia.org/wiki/Bloom_filter">Bloom filter</a> which is one of the most widely used data structures.
The basic philosophy here is surprisingly powerful: CountMin allows to maintain an approximate version of the most basic data structure, an array, using space which is independent of the array's size.
Taking this further, linear sketching is a very powerful tool for designing algorithms for massive data regardless of the computational model. Whether it is streaming, MapReduce or take your pick, linear sketches are often the best solution known and/or proved to be optimal. They can also be implemented using basic linear algebraic primitives (see next bullet).
</li>
<li><b>Algorithms based on linear algebraic primitives.</b>
I think that avoiding combinatorial magic is the key to making algorithms robust to the choice of the computational model and also more parallel (see next bullet).
Whenever, there is a solution which only uses basic linear algebra, it might be a good idea to prefer it over a combinatorial algorithm even if the latter is a little bit faster and/or easier to implement from scratch.
Regardless of the computational model one can expect linear algebraic primitives to be already implemented there (e.g. MatLab).
A good example here is the All-Pairs-Shortest-Paths problem (see Uri Zwick's <a href=" http://www.diku.dk/PATH05/Uri1.pdf">slides</a> for details).
Other examples are PageRank, applications of SVD and linear sketching algorithms described above.
</li>
<li><b>Parallel algorithms and data structures.</b> When faced with multiple algorithmic alternatives it might be a good idea to pick one that is parallelizable.
E.g. among the Prim's, Kruskal's and Boruvka's algorithm for MST Boruvka's is the winner here because it is the only one that is not sequential. This fact is used in a variety of parallel MST algorithms.
Linear sketching is again going to be handy here.
Algorithms based on sorting and hash tables are good since these primitives are often very efficiently implemented in parallel systems (e.g. Hadoop, DHT).
</li>
<li><b>Data structures and NP-completeness => Advanced classes.</b> In order to make some room for the suggestions described above I would suggest to reduce discussion of these topics to the bare minimum that is necessary in order to cover the core algorithmic ideas.
I believe that each of them by itself deserves to be covered in a separate class. With hundreds of students enrolled these topics start to feel too specialized for an introductory algorithms class.
NP-completeness can be combined with other topics in computational complexity and automata theory to make it a semester long course.
Data structures seem to go naturally with advanced algorithms as another course. To spice things up one can even add <a href="http://www.amazon.com/Purely-Functional-Structures-Chris-Okasaki/dp/0521663504 ">purely functional data structures</a>.
This may sound a little controversial but there seems to be a general tendency towards moving away from data structues among the books and curricula discussed above.
As for NP-completeness, I think it depends on whether a separate class on the theory of computing is offered which for any good school I really believe should be the case.
</li>
</ul>
<p><a href="http://grigory.us/blog/modern-intro-algorithms/">Modern Algorithms or The Brave New O of the Big N</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on May 09, 2015.</p><![CDATA[Models for Parallel Computation (Hitchhiker's Guide to Massively Parallel Universes)]]>http://grigory.us/blog/massively-parallel-universes2015-05-03T00:00:00+00:002015-05-03T00:00:00+00:00Sergei Vassilvitskiihttp://grigory.us/blogsergei@cs.stanford.edu<div align="center"><img alt="MPC" src="http://grigory.us/blog/pics/parallel-models.jpg" /> </div>
<p>(This blog post is our joint effort with Sergei, all typos and missing Oxford commas are mine). The quest to make massively parallel computation easily accessible to everyone has been a daunting one for many generations of computer scientists and engineers.
While far from being complete, with cloud computing infrastructure available through <a href="http://aws.amazon.com/ec2/">Amazon EC2</a>, <a href="https://cloud.google.com/compute/">Google Compute Engine</a>, and
other similar platforms a significant milestone has been reached.
At the same time the quest to establish rigorous theoretical foundations of massively parallel computing has led to development of multiple theoretical models.
Despite different modeling assumptions underlying these models, many parallel algorithmic techniques can be used in some or all of them with minor modifications. Moreover, in some restricted scenarios even direct simulations are available. In this post we discuss some of the most popular theoretical models for parallel computing and the relationships between them.</p>
<h1>MapReduce</h1>
<p>We will strongly emphasize connections between these different models and the modern MapReduce model for computation in the cloud (MRC) that we described <a href="http://grigory.us/blog/mapreduce-model/">here</a>.
As a reminder, the MRC model is specified by the number <b>M</b> of identical machines each having <b>S</b> local memory.</p>
<div align="center"><img alt="MapReduce Storage" src="http://grigory.us/blog/pics/mr-storage.png" /></div>
<p>The goal of the algorithm design is to minimize the number of parallel rounds of computation which we denote as <b>R</b>.
In each communication round the number of bits sent and received by each machine is at most <b>S</b> (in fact, in most cases only the bound on the incoming communication matters).</p>
<div align="center"> <img alt="MapReduce Computation Diagram" src="http://grigory.us/blog/pics/mr-computation-diagram.png" /></div>
<p>We will use the Minimum Spanning Tree (MST) problem as a benchmark for comparison between the models.
In MRC the MST problem can be solved in constant number of rounds for sufficiently dense graphs.
As shown through a filtering technique by Lattanzi et al. <a href="http://theory.stanford.edu/~sergei/papers/spaa11-matchings.pdf ">here</a>
for graphs with <script type="math/tex">|E| = n^{1 + c}</script> edges <script type="math/tex">\lceil c/\epsilon\rceil</script> rounds suffice assuming <script type="math/tex">M = O(n^{1+\epsilon})</script> and <script type="math/tex">S = O(n^{c-\epsilon})</script> so that the total space is <script type="math/tex">M * S = O(|E|)</script>.</p>
<h1>Bulk Synchronous Parallelism</h1>
<p>The Bulk Synchronous Parallel model (BSP) was introduced by Leslie Valiant in 1990 in his seminal article “<a href="http://web.mit.edu/6.976/www/handout/valiant2.pdf ">A Bridging Model for Parallel Computation</a>”.
While this model has been subsequently <a href="http://people.seas.harvard.edu/~valiant/bridging-2010.pdf">refined to capture multicore computing</a> in 2008, here we focus on the original BSP model.
BSP and MRC are very closely related. The key idea behind BSP is breaking computation into synchronized supersteps and it later formed the basis in MRC.</p>
<p>BSP computation assumes a set <b>p</b> of processors, each with local memory. Moreover, the computation proceeds in a series of global synchronized supersteps. There are three parameters that are combined to give the cost of a BSP computation:</p>
<ul>
<li>The number of processors, <b>p</b>.</li>
<li>The number of timesteps needed to synchronize, <b>l</b> (communication latency).</li>
<li>The number of timesteps needed to send one word of memory to a different machine, <b>g</b> (communication gap).</li>
</ul>
<p>Some of the descriptions of the models also include the speed of each processor (instructions/sec), <b>s</b>.
But we can easily factor that out when working with homogeneous machines.</p>
<p>The cost of a superstep where each processor does at most <b>x</b> operations and sends/receives at most <b>h</b> words is then: <b>l</b> + <b>x</b> + <b>g</b> * <b>h</b>.</p>
<p>The total work is <b>p</b> times the cost of the superstep, and the efficiency is the ratio of the best sequential algorithm to the total work (over all supersteps). Note that since latency and gap may depend on the number of processors, the total cost is superlinear in the number of processors.</p>
<p>One of the goals of BSP was to give the analytically best algorithm for different settings of the parameters. This would allow one to decide what is the fastest or most work efficient algorithm for a particular setting. It is not surprising then that some of the early work focused on how to minimize values of <b>g</b>, <b>l</b> in various network topologies (torus, hypercube, butterfly, etc.). Additional work also measured these across different networks realized in practice.</p>
<p>One way to generalize the BSP model is to model the fact that communication costs are not linear (sending 1Mb between machines is much cheaper than sending 1,000,000 distinct one byte messages.) We can model this by letting <b>G</b>() be a function of the number of words sent. In traditional BSP then, <script type="math/tex">G(h) = g * h</script>.
In the MRC model of computation, <script type="math/tex">G(h)</script> is discretized – with <script type="math/tex">G(h) = \lceil h / S\rceil * K</script> for some large constant <b>K</b> that dwarfs all computation costs. Such choice of a cost function implies that in order to best use communication the computation should be broken into rounds with at most <script type="math/tex">S</script> bits sent between the rounds.
In MRC <b>l</b> is taken to be O(1) as synchronization can proceed at any time.</p>
<p>Due to a large number of parameters the algorithmic results in the BSP model tend to be bulky to state.
We refer the reader to <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.3708&rep=rep1&type=pdf">this paper</a> by Adler et al. for the details of the MST algorithm in the BSP.
To the best of our knowledge there aren’t many algorithmic results in the BSP model. This might be partially due to its invention having been ahead of its time and partially due to a large number of parameters which makes it not very friendly for algorithm design.
One of the key contributions of the MRC model is reduction in the number of parameters making algorithmic results cleaner and easier to state and compare.</p>
<h1>Parallel Random Access Machines</h1>
<p>The model is characterized by:</p>
<ul>
<li> The number of processors, <b>p</b>.</li>
<li> The size of the shared memory, <b>m</b>.</li>
<li> The size of local memory available to each processor, <b>l</b>.</li>
</ul>
<p>Furthermore, read/write access to the shared memory might be implemented differently:</p>
<ul>
<li><b>ER/CR</b>: Exclusive Read/Concurrent Read allow to perform read access to each shared memory cell either to only one processor (ER) or to any number of processors (CR) during each superstep. </li>
<li><b>EW/CW</b>: Exclusive Write/Concurrent Write allow to perform write access to each shared memory cell either to only one processor (EW) or to any number of processors (CW). In case if multiple write operations occur to a shared memory cell under CW there exist multiple policies for resolving which value is going to be written (priority-based, random, arbitrary, etc.). </li>
</ul>
<p>Among four possible combinations of read/write access rules only three are typically considered: EREW, CREW, and CRCW (requires a specification of the conflict resolution policy). Since we only discuss PRAMs very briefly here, we refer the reader to <a href="https://www.cs.cmu.edu/~guyb/papers/BM04.pdf">this paper</a> by Guy Blelloch and Bruce Maggs and <a href="http://www.cs.cmu.edu/afs/cs/academic/class/15499-s09/www/ ">this class</a> at CMU for a comprehensive introduction to PRAMs.</p>
<p>A restricted class of EREW PRAM algorithms can be simulated in MRC with only a constant overhead in the number of rounds.
The basic idea is the following.
Assuming that:</p>
<ul>
<li><b>m</b> + <b>p</b> * <b>l</b> < <b>M</b> * <b>S</b> (the total memory used by PRAM algorithm is less than the total memory available to MRC)</li>
<li><b>l</b> < <b>S</b> (local memory in PRAM is less than in MRC, which is reasonable given that typically <b>p</b> is much bigger than <b>M</b>)</li>
</ul>
<p>one can simulate EREW access to all the <script type="math/tex">m</script> shared memory cells in <script type="math/tex">O(1)</script> rounds while using <script type="math/tex">p * l</script> memory to perform the local computations.
This is done by assigning <script type="math/tex">(p * l) / S</script> machines to simulate <script type="math/tex">S/l</script> PRAM processors each and <script type="math/tex">m/S</script> machines to simulate the shared memory cells. Since the read/write requests are exclusive they can be directly communicated between the simulated processors and the simulated memory cells. Note that simulating concurrent reads might overload the machines simulating the memory if much more than <script type="math/tex">S</script> requests are simultaneously submitted.
See Theorem 7.1 <a href="http://www.eecs.harvard.edu/~michaelm/E210/modelmapreduce.pdf">here</a> for the details (modulo the ER vs. CR issue discussed above).
Also note that while the number of rounds is preserved well this simulation might lead to time-inefficient algorithms since multiple PRAM processors might be simulated sequentially on a single MRC machine.</p>
<h1>Models with Restricted Communication</h1>
<p>In models with restricted communication the input graph corresponds to a communication network between <script type="math/tex">n</script> machines.
Initially each machine has the list of its neighbors as the input.
Moreover, the communication is restricted in one of the two ways:</p>
<ul>
<li>Messages can only be sent over an underlying network between the machines, i.e. in every round each machine can only talk to its neighbors.
In this case for graph problems which depend on the entire input the diameter of the network <b>D</b> often gives a lower bound on the number of rounds since messages must be propagated through the network.
</li>
<li>Messages are restricted in size. When discussing size-restricted messages below we will always assume that the bound on the message size is <b>W</b> = O(log n) where n is the input size.</li>
</ul>
<p>The three models discussed below (LOCAL, CONGEST and CONGESTED CLIQUE) correspond to the three possible combinations of these two restrictions.
In all these models the computational power of the machines is either unbounded or limited to polynomial time computation on their input.
We discuss these models only briefly here since they are more applicable to settings such as sensor networks rather than massively parallel computing. A good introduction into algorithmic techniques is given in <a href="http://people.csail.mit.edu/ghaffari/DGA14/">this class</a> at MIT.</p>
<h2>LOCAL</h2>
<p>The most basic of the restricted communication models is LOCAL.
In this model the communication is restricted to the underlying network while message size is unbounded.
This allows to solve any problem trivially in <b>D</b> rounds. Lower bounds in this model show that <script type="math/tex">\Omega(D)</script> rounds are necessary for computing a Minimum Spanning Tree and 2-coloring even for instances as simple as an even cycle.</p>
<h2>CONGEST</h2>
<p>In the CONGEST model restrictions are imposed both on the communication network and the message size (<b>W</b>=O(log n)).
Two flavors of these models exist: one that allows to send arbitrary messages and one that only allows each machine to broadcast the same message to its neighbors. Below we discuss the first of these two versions which has been studied more extensively.
In this model the number of rounds necessary and sufficient for computing a Minimum Spanning Tree is <script type="math/tex">\tilde O(D + \sqrt{n})</script>.</p>
<h2>CONGESTED CLIQUE</h2>
<p>Finally in the CONGESTED CLIQUE model we only have the message size restriction (<b>W</b>=O(log n)).
In this model a Minimum Spanning Tree can be computed deterministically in <script type="math/tex">O(\log \log n)</script> rounds (Lotker, Patt-Shamir, Pavlov, Peleg ‘05).
A <a href="http://arxiv.org/pdf/1412.2333v1.pdf">recent preprint</a> shows an <script type="math/tex">O(\log \log \log n)</script>-round randomized algorithm.
This model is the closest of the three to the MRC model because there is no restriction on the communication topology.
In some restricted scenarios there exist simulations of algorithms for CONGESTED CLIQUE in the MRC model, see <a href="http://arxiv.org/pdf/1405.4356.pdf ">here</a>.
However, in general the CONGESTED CLIQUE model is incomparable to MRC since both the outgoing and incoming communication for each machine are allowed to be linear in the input size while for MRC both are strictly sublinear.
In particular, sparse graph connectivity can be solved in CONGESTED CLIQUE in one round by sending all edges to a single machine.</p>
<h1>The “Big Data” Model</h1>
<p>The “Big Data” model introduced in <a href="http://arxiv.org/pdf/1311.6209">this paper</a> is a generalization of the CONGESTED CLIQUE model. Instead of having the number of machines being the same as the number of vertices in the graph, the number of machines is treated as a parameter <script type="math/tex">k \le n</script>.
The input graph is vertex partitioned between these <script type="math/tex">k</script> machines. In one round any pair of machines is allowed to communicate using messages of size <b>W</b>=O(log n).
Close relationship to the CONGESTED CLIQUE model allows to reuse existing algorithmic techniques.
Near-optimal results for many fundamental problems in the “big data” model are given <a href="http://www.researchgate.net/profile/Hartmut_Klauck/publication/258849574_The_Distributed_Complexity_of_Large-scale_Graph_Processing/links/541a5a450cf203f155ae22e7.pdf">here</a>.
In particular for computing a Minimum Spanning Tree <script type="math/tex">\tilde O(n/k)</script> rounds are necessary and sufficient.</p>
<p>Among all models with restricted communication the “big data” model is the one most similar to MRC.
However, a hard bound on communication leads to very strong lower bounds in this model such as the <script type="math/tex">\tilde \Omega(n/k)</script> lower bound for MST discussed above.</p>
<p><a href="http://grigory.us/blog/massively-parallel-universes/">Models for Parallel Computation (Hitchhiker's Guide to Massively Parallel Universes)</a> was originally published by Sergei Vassilvitskii at <a href="http://grigory.us/blog">The Big Data Theory</a> on May 03, 2015.</p><![CDATA[MapReduce and RDBMS: Practice and Theory]]>http://grigory.us/blog/rdbms-mapreduce2015-04-02T00:00:00+00:002015-04-02T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<div align="center"><img alt="Mapreduce and RDBMS" src="http://grigory.us/blog/pics/mapreduce-rdbms.png" /> </div>
<p>Congratulations to Michael Stonebraker on winning the ACM Turing Award last week!
Michael is recognized for his fundamental contributions to the concepts and practices underlying modern database systems. It is somewhat unfortunate though that the RDBMS community and the MapReduce crowd ended up being split apart after the 2010 CACM articles <a href="http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext">MapReduce: A Flexible Data Processing Tool</a> (by Jeffrey Dean and Sanjay Ghemawat) and <a href="http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext">MapReduce and Parallel DBMSs: Friends or Foes</a> (by Michael Stonebraker et al.).</p>
<p>Stonebraker’s criticism of MapReduce/Hadoop started back in 2008 with a post <a href="http://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html">MapReduce: A major step backwards</a>. It has only changed slightly over the last 7 years (see e.g. <a href="http://cacm.acm.org/blogs/blog-cacm/177467-hadoop-at-a-crossroads/fulltext">this</a>).
A good example is <a href="https://www.youtube.com/watch?v=OYGJe1z97VI">Michael’s talk</a> at XLDB12 which I found fun and educational.
Moreover, I tend to agree with most of waht Michael says except when it comes to Hadoop (e.g. “Hadoop is right at the top of the Gartner group hype cycle”, M.S. 2012) because over the 7 years of his criticism Hadoop has become the most successful open source platform for general purpose massively parallel computing.
It probably won’t be surprising to see Dean and Ghemawat winning the Turing Award in the future for making massively parallel computing commonplace.</p>
<p>I believe that both in theory and in <a href="http://data-informed.com/hadoop-vs-data-warehouse-comparing-apples-oranges/">practice</a> MapReduce and RDBMSs are apples and oranges in the big data universe. Both are crunching hundreds of petabytes of data these days. However, in my experience some people still think that there is a way to directly compare the two and determine a single winner. I heard about Stonebraker’s criticism of Hadoop so many times and in places so diverse (e.g. from one of my running club buddies on the Penn track as well as during my visit to Princeton in a conversation with one of the professors I highly respect) that I decided to write a blog post about it. I will try to “bite the bull by the horns” (expression courtesy of Ken Clarkson) and summarize the advantages of each paradigm from my experience both in practice and in theory.</p>
<h1>Advantages of MapReduce</h1>
<div align="center"><img alt="MapReduce = Magic Hammer" src="http://grigory.us/blog/pics/mapreduce-hammer.png" /> </div>
<p>MapReduce paradigm has emerged as a universal tool for a specific type of parallel computing.
I would compare it with a magic hammer that in theory allows you to do almost anything you might want. While a “Swiss army knife” RDBMS solution would certainly be more efficient for specific tasks that it has been designed for the magic hammer of MapReduce works for almost any problem that is possible to parallelize.</p>
<ul>
<li> <b> Universality.</b> A big advantage of MapReduce is its universality. In particular, it offers efficient low-level access to the data for a software engineer. This allows to handle completely unstructured messy data.
<!--It is a great advantage for algorithm designers that MapReduce doesn't impose any restrictions on the format of the data.
It also offers low-level access for the software engineer who can manipulate data entries without any restrictions on the type of queries. -->
For example, many graph algorithms can be easily implemented in MapReduce, while general purpose databases don't play well with graphs. This is a well-known issue and also the reason why specialized graph databases such as <a href="http://neo4j.com/">Neo4j</a> exist.
While learning how to use MapReduce takes some time and experience with programming in my experience good software engineers can learn it fairly quickly. This is why for best companies such as Google, Facebook, etc. the learning curve and cost of skillful engineers doesn't seem to be an issue.
</li>
<li><b>Customization.</b> During the 10 years of its existence the base Hadoop layer has been extended by many different frameworks that can run on top of it.
Examples of free such frameworks are <a href="http://spark.apache.org/">Spark</a> (greatly improved raw Hadoop efficiency), <a href="http://hortonworks.com/hadoop/storm/">Apache Storm</a> (streaming support) and others. In particular, most of the Stonebraker's criticism regarding inefficiency of Hadoop is no longer applicable because of these improvements. After the inefficient higher-level Hadoop layers are replaced it is only the HDFS that remains untouched. According to Stonebraker himself: "I don't have any problem with HDFS, it is a parallel file system <...> by all means go ahead and use it".
Companies such as Google and Facebook are running their own custom versions of Hadoop/MapReduce and while most of the details are secret we routinely hear in the news about petabytes of user data being crunched daily in such systems.
</li>
</li>
<li> <b> Support of your favorite programming language.</b> With <a href="http://hadoop.apache.org/docs/r1.2.1/streaming.html">Hadoop Streaming</a> one can use any programming language. </li>
<li> <b>Open source.</b> Apache Hadoop is an easy to learn open source version implementation of the MapReduce framework.</li>
</ul>
<h1>Advantages of Parallel RDBMSs</h1>
<div align="center"><img alt="RDBMS = Swiss army knife" src="http://grigory.us/blog/pics/rdbms-swiss-knife.png" /> </div>
<p>Database management system technology has been perfected for over more than 40 years becoming a “Swiss army knife”-type solution for big data management.</p>
<ul>
<li> <b>Efficient processing of typical queries on relational and some other types of data.</b> For relational data efficiency of parallel RDBMSs is outstanding. I am not aware of successful attempts to beat the performance of RDBMSs on their home turf using general purpose frameworks for massively parallel computing (e.g. by using Hive on Hadoop discussed below). Moreover, specialized database systems exist also for other types of structured data such as graphs (e.g. Neo4j), sparse arrays, etc. Just like with a Swiss Army knife, if a certain application can be directly handled by an RDBMS then it is probably handled pretty well and most common use cases are pretty well covered.
</li>
<li> <b>Simplicity.</b> While this is clearly subjective and might change over time, currently the learning curve for MapReduce users seems to be much steeper than for those who use an RDBMS.
Simplicity also means that it costs less to employ data analysts who can work with RDBMSs. </li>
</ul>
<h1>In Theory</h1>
<p>
As a theorist I am very excited about the fact that the performance of MapReduce-style systems can be systematicaly analyzed using a rigorous theoretical framework. See my earlier <a href="http://grigory.us/blog/mapreduce-model/">blog post</a> for the details of the formal theoretical model for MapReduce.
</p>
<p>
It is very exciting to see MapReduce-style algorithms making it into advanced algorithms classes focused on dealing with big data at many top schools. Some examples that I am aware of are:
<ul>
<li>“<b><a href="http://www.cs.columbia.edu/~coms699812/">Dealing with Massive Data</a></b>” by Sergei Vassilvitskii at Columbia.</li>
<li>“<b><a href="http://people.seas.harvard.edu/~minilek/cs229r">Algorithms for Big Data</a></b>” by Jelani Nelson at Harvard.</li>
<li>“<b><a href="http://web.stanford.edu/~ashishg/amdm/ ">Algorithms for Modern Data Models</a></b> ” by Ashish Goel at Stanford.</li>
<li> “<b><a href="http://www.cs.utah.edu/~jeffp/teaching/cs7960.html">Models of Computation for Massive Data</a></b>” by Jeff Philips at the University of Utah.</li>
</ul>
There are other examples too. In fact, these days almost every theoretical class about algorithms for big data that I am aware of covers MapReduce algorithms.
</p>
<p>
Moreover, there are clean hard open problems raised by the MapReduce model which would have strong implications for the rest of theoretical computer science including such fundamental its parts as circuit complexity, communication complexity and approximation algorithms as well as more modern areas such as streaming algorithms.
For example, a notorisouly hard question (see details <a href="http://grigory.us/blog/mapreduce-model/">here</a>) is: "<b>Can sparse undirected graph connectivity be solved in o(log |V|) rounds of MapReduce? Hint: Probably, no.</b>" Resolution of this kind of open questions will not only surprise the practitioners but might also win you a best paper award at one of the top theory conferences (and it is most likely going to be not because of MapReduce itself but because of other deep consequences such a result would have).
On the other hand, I am unaware of open questions in databases which would have the same level of appeal to the theoretical community.
</p>
<p>
A flagship theory conference STOC 2015 together with the 27th ACM Symposium on Parallelism in Algorithms in Architectures (colocated at FCRC 2015) will host a 1-day workshop "<b>Algorithmic Frontiers of Modern Massively Parallel Computation</b>" focused on theoretical foundations of MapReduce-style systems and directions for future research which I am co-organizing together with Ashish Goel and Sergei Vassilvitskii. I will post the details later so stay tuned if you are interested.
</p>
<h1>P.S. Apple-oranges</h1>
<div align="center"><img alt="Apple-Orange + Hive" src="http://grigory.us/blog/pics/orange-apple-hive.png" /> </div>
<p>While there is room for apple-oranges none of these seem to be successful so far. It seems to be common sense that SQL-on-Hadoop just like an apple-on-orange is not a great idea in terms of performance. Limited success in attempts such as Hive on Hadoop seem to prove this so far. Using low-level programming languages such as C++ with RDBMSs is also possible (see e.g. <a href="http://www.sqlapi.com/">SQLAPI</a>). However, described above advantages of RDBMSs most likely vanish if you do so.</p>
<p><a href="http://grigory.us/blog/rdbms-mapreduce/">MapReduce and RDBMS: Practice and Theory</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on April 02, 2015.</p><![CDATA[Sublinear Day at MIT]]>http://grigory.us/blog/sublinear-day2015-03-10T00:00:00+00:002015-03-10T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<div align="center"><img alt="Sublinear Day: April 10" src="http://grigory.us/blog/pics/sublinear-day-2015.png" /> </div>
<p>The second “Sublinear Algorithms and Big Data Day” will take place at MIT on <b>April 10</b>.
The speakers are: <a href="http://people.csail.mit.edu/costis/ ">Costis Daskalakis</a>, <a href="http://www.wisdom.weizmann.ac.il/~robi/ ">Robert Krauthgamer</a>, <a href="http://people.seas.harvard.edu/~minilek/ ">Jelani Nelson</a>, <a href="http://www.math.rutgers.edu/~ss1984/ ">Shubhangi Saraf</a> and <a href="http://cs.brown.edu/~pvaliant/">Paul Valiant</a>.</p>
<p>For the first time we will have a poster sesion. Poster proposal submission deadline is very close: <b>March 20</b>.
More information about poster submission, schedule, etc. is available <a href="http://www.gautamkamath.com/sublinearday/">here</a>.</p>
<p>We will really appreciate it if you help us spread the word and hope that the second sublinear day will bring superlinear amounts of research interactions and joy :) Thanks again to <a href="http://www.gautamkamath.com/">Gautam “G” Kamath</a> who is in charge of local arrangements and to <a href="http://people.csail.mit.edu/costis/ ">Costis Daskalakis</a> and <a href="http://people.csail.mit.edu/indyk/">Piotr Indyk</a> for helping make this happen!</p>
<p><a href="http://grigory.us/blog/sublinear-day/">Sublinear Day at MIT</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on March 10, 2015.</p><![CDATA[Happy Sublinear Year!]]>http://grigory.us/blog/happy-sublinear-year2015-01-01T00:00:00+00:002015-01-01T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<p>The decision to start a blog three months ago proved to be a lot more rewarding and time-consuming than I expected.
Looking at the stats I was thrilled to find out that over its short lifespan this blog has reached 1650 cities in 99 countries.
This is just a notch over than half of all countries in the world. Among my favorite places reached are <a href="http://en.wikipedia.org/wiki/Turks_and_Caicos_Islands">Turks and Caicos Islands</a> (overall population ~30 thousand people).</p>
<div align="center"><img alt="Happy 2015!" src="http://grigory.us/blog/pics/2015.png" /> </div>
<p>Looking forward into 2015 I am happy to announce that there will be at least two reasons to rejoice for those interested in sublinear algorithms for big data:</p>
<div>
<ul class="fa-ul">
<li> <i class="fa li fa fa-group"> </i> The second “Sublinear Algorithms and Big Data Day” will take place at MIT on April 10. Thanks to <a href="http://www.gautamkamath.com/">Gautam "G" Kamath</a> who is in charge of local arrangements and to <a href="http://people.csail.mit.edu/costis/ ">Costis Daskalakis</a> and <a href="http://people.csail.mit.edu/indyk/">Piotr Indyk</a> for their support!
This event follows <a href="http://grigory.us/big-data-day.html">the first in this series</a>, which I organized at Brown in 2014, and we really hope to keep this tradition for many years to come.
</li>
<li> <i class="fa li fa fa-group"></i> On August 27-28 <a href="http://dimacs.rutgers.edu/">DIMACS</a> at Rutgers will host a workshop on massively parallel and sublinear algorithms.
The organizers, including <a href="http://www.cs.rutgers.edu/~muthu/ ">Muthu</a>, <a href="http://www.mit.edu/~andoni/ ">Alex Andoni</a> and myself, would like to thank the director of DIMACS <a href="http://www.cs.rutgers.edu/~rwright1/">Rebecca Wright</a> for helping to make this happen. Note that this event will be immediately after RANDOM/APPROX 2015 at Princeton (August 24-26). </li>
</ul>
The details about both events + some more to come will appear on this blog later. Stay tuned and Happy 2015!
</div>
<p><a href="http://grigory.us/blog/happy-sublinear-year/">Happy Sublinear Year!</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on January 01, 2015.</p><![CDATA[Getting a Research Internship]]>http://grigory.us/blog/research-internship2014-12-20T00:00:00+00:002014-12-20T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<p>For graduate students December is probably the best time to start applying for a summer internship. The process of getting a software engineering internship at places like Google, Facebook, Microsoft, Twitter, Quora, Dropbox, etc. is so much streamlined that there is even a movie about it.</p>
<div align="center"><img alg="Internship movie" src="http://grigory.us/blog/pics/internship-movie.jpg" /></img></div>
<p><br />
Getting an internship in research is a lot more of a unique experience. Graduate students ask me about this a lot and even once invited to give an informal talk on the topic at the Wharton graduate students statistics seminar. Surprisingly, I haven’t seen any guide about research internships available online so I decided to write down things I usually say.</p>
<p>
<b>Disclaimer</b>: advice below is my personal opinion and is biased towards computer science and the U.S, although sometimes applies more broadly (e.g. to pure math, applied math and statistics).
My experience is based on internships at AT&T Research (Shannon Laboratory), IBM Research (Almaden) and Microsoft Research (Silicon Valley and Redmond) and might be somewhat specific to these places, only 50% of which still exist. I also visited Yahoo! and Google Research a lot and can say that these places work similarly but with a few important differences described below.
Last but not least, I tried to do my best to avoid making any comparisons (and especially controversial ones) in terms of quality between different labs and academia.
<div align="center"><img alg="Research labs" src="http://grigory.us/blog/pics/labs.png" /></img></div>
<h1>Why Do It?</h1>
<p>
Doing a research internship is a great way to meet your future collaborators and friends. I did it 4 times (the most you can do as an F1 student) and spent 50% of my 3-year PhD in research labs. While this is quite unusual, I should say that I enjoyed the experience tremendously and still work and talk regularly with most of my former mentors (<a href="http://dimacs.rutgers.edu/~graham/ ">Graham Cormode</a>, <a href="http://scholar.google.com/citations?user=i5PazXwAAAAJ">Howard Karloff</a>, <a href="http://researcher.watson.ibm.com/researcher/view.php?person=us-dpwoodru ">David Woodruff</a>, <a href="http://www.mit.edu/~andoni/">Alex Andoni</a> and <a href="http://konstantin.makarychev.net/">Konstantin Makarychev</a>), who have also been an invaluable source of advice for me over the years. I met many of my best friends from grad school times in the labs and still visit places where I did an internship whenever I happen to be in the area.
</p>
<h1>Where to Apply?</h1>
<p>
This highly depends on your area of expertise. These days there is no single best place like <a href="http://en.wikipedia.org/wiki/Bell_Labs ">Bell Labs</a> in its glory days but there are multiple options to consider.
<h3>Pure Theory</h3>
For pure theoretical computer science I would suggest to start with MSR and IBM. While I am certainly biased and there can't possibly be a ranking of research labs I would say that these are the top places:
<ul>
<li> Microsoft Research. <a href="http://research.microsoft.com/en-us/jobs/intern/apply.aspx">Posting</a>, multiple locations (main offices are in Redmond, Cambridge and NYC). Redmond office is the oldest and covers almost all areas. NYC and Cambridge offices are smaller and somewhat similar, being particularly strong in machine learning, social sciences, algorithmic game theory and computational complexity among other areas.</li>
<li> IBM Research. <a href="http://www.research.ibm.com/careers/internships/index.shtml">Posting</a>, multiple locations (main offices are in Yorktown Heights, NY and Bay Area). <a href="http://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=4491">Posting</a> from Ken Clarkson about theory positions at IBM Almaden. Almost all major areas are represented in either the Yorktown Heights or the Bay Area location. </li>
</ul>
</p>
<br />
<p>
Another great place is Toyota Technological Institute at Chicago (<a href="http://www.ttic.edu/intern.php">posting</a>). I put it in a slightly different category than the industrial research labs because of its close ties with the University of Chicago and the overall feel of a more academic rather than industrial environment.
While AT&T and Bell are shadows of their past, they still have some amazing people:
<ul>
<li>AT&T Labs – Research. <a href="http://www.research.att.com/internships?fbid=CKoDhzaFztt">Posting</a>, multiple locations (main office is in New Jersey).</li>
<li>Bell Labs. <a href="http://www.alcatel-lucent.com/careers/opportunities-students">Posting</a>, multiple locations (main office is in New Jersey). </li>
</ul>
</p>
<h3>More Applied</h3>
Here is a list of some more applied research places off the top of my head:
<p>
<ul>
<li> Google Research. <a href="https://www.google.com/about/careers/search#t=sq&q=j&li=10&jed=DOCTORATE&jex=PURSUING_DEGREE&je=INTERN&jc=SOFTWARE_ENGINEERING&jc=HARDWARE_ENGINEERING&jc=NETWORK_ENGINEERING&jc=TECHNICAL_INFRASTRUCTURE_ENGINEERING ">Google jobs website</a>, multiple locations (including NYC and Bay Area). Google usually doesn't make a distinction between research and software engineering positions in their search. Once you pass the standard software engineering screening process, you get into the host matching phase and you can find a mentor interested in research. </li>
<li> Yahoo! Labs. <a href="http://labs.yahoo.com/careers/?section=internship">Posting</a>, multiple locations (including NYC, Bay Area and Barcelona).</li>
<li> Facebook Research. <a href="https://www.facebook.com/careers/department?req=a0IA000000CzCGu">Posting</a> from Yann LeCun, multiple locations (including NYC and Bay Area).</li>
<li> Ebay Labs. <a href="https://labs.ebay.com/careers/cesr/">Posting</a>, located in Bay Area. </li>
<li> Technicolor. <a href="http://www.technicolor.com/en/innovation/student-day/job-internship-opportunities-ri-labs">Posting</a>, located in Bay Area.</li>
<li> HP Labs. <a href="http://www.hpl.hp.com/careers/students-and-interns/">Posting</a>, main location in New Jersey.</li>
<li> NEC Labs. <a href="http://www.nec-labs.com/working-at-nec-labs/internship ">Posting</a>, main location in New Jersey.</li>
<li> VMWare. New lab founded by some of the former Microsoft SVC researchers, main location in Bay Area. Added to the list by suggestion from one of the founding members, <a href="http://udiwieder.wordpress.com/">Udi Wieder</a>.</li>
</ul>
</p>
<br />
<p>
I am only familiar with the first two, which seem to have a less strong commitment to fundamental research than the labs listed before. However, the total number of job and internship openings in this slightly more applied list is probably almost an order of magnitude larger.
</p>
<h3>National Labs</h3>
<p>
There are also many national labs. As a Russian citizen I can't tell you much about experience at these (especially in crypto, where I am sure my list is highly incomplete), but here are some options.
Somewhat surprisingly, even these places sometimes have opportunities for international students, which may not be well advertised.
Here are a couple of places to consider:
<div>
<ul>
<li>Sandia Labs. <a href="http://www.sandia.gov/careers/students_postdocs/internships/">Postings</a>, main locations include Albuquerque, NM and Livermore, CA</li>
<li>Berkeley Lawrence. <a href="http://education.lbl.gov/Programs/Internships.html">Postings</a>, located at Berkeley, CA.</li>
</ul>
</div>
</p>
<h1>Internship Tips</h1>
<p>
<div>
Internships at research labs and talented interns are both scarce and unique resources so there isn't much data to look at and the decision making process is highly random.
However, there are a few things you can do to improve your chances.
<ul>
<li><b>List your top choices and potential mentors</b>. It helps a lot if you know your future mentors in person. At the very least make sure that you are familiar with their work. A great lab can easily get hundreds of applications. What matters most for the success of your application is whether there is a mentor who will pick you from the pile. Don't hesitate to contact your top choices either directly or through your advisor, but avoid pestering people. Also, mention your potential mentors' names in different parts of your application (forms, research statement, etc.) </li>
<li><b>Apply everywhere</b>. While your chances are significantly reduced when you send a cold application, sometimes there are things you don't know and forces beyond your control. This is especially true for graduate students in their early years.
Indicating your interest may be important by itself, e.g. I gave my first talk at an industrial lab which couldn't offer me an internship but invited for a short visit. It felt a lot like <a href="http://en.wikipedia.org/wiki/Peggy_Olson ">Peggy Olson</a>'s (<a href="http://en.wikipedia.org/wiki/Mad_Men ">Mad Men</a>) first business trip to Richmond but better – thanks to IBM I had no dogs having sex as a view from the hotel room :)
</li>
<li><b>Recommendation letters</b>. Ask your letter writers as soon as possible (ideally at least a month in advance), picking them based on the list of your top choices and other places where you plan to apply. Ask your letter writers for suggestions about places and feedback on your application materials. </li>
<li><b>Research statement</b>. For pure research positions you will need to write a research statement. This is a great opportunity to take time to work on improving your vision. If you are like me then this is a process both difficult and rewarding. Every year when I applied I started by throwing my previous research statement into a trash bin because it looked absolutely terrible. I even remember myself getting very upset once because my vision was such a crap compared to some of the people in the labs where I applied. While your research statement is going to be unlike anyone else's I still recommend looking for inspiration at the research statements of your role models (maybe even potential mentors if they are available). You can ask for feedback on your statement from your advisor, colleagues and friends but I wouldn't expect too much because your statement is truly yours. Make sure you customize some parts of your statement for different places. Finally, for an internship application the research statement often doesn't matter too much, so you don't have to stress too much over it. However, I would still recommend to think of it as a dress rehearsal for your future applications as well as an opportunity to develop your vision.</li>
</ul>
<h1> FAQ</h1>
<ul>
<li><b>How much does it pay?</b> Usually about the same or slightly more than a software engineering internship. I would expect $6–8K/mo (fixed, no negotiation, overtime or bonuses) + standard benefits such as relocation, car rental and housing discounts. So money wise this is certainly better than academia, but can easily be at least two times less than an internship in finance (if you charge overtime, include bonuses, etc.). However, if you are doing what you enjoy most then you might care less about the money.</li>
<li><b>Can I do an internship during the Fall/Spring semester?</b> Yes. The main advantage is that researchers at the lab are likely to be more available during these semesters. Another advantage is that if you are doing a summer internship in the same area then you can do two internships back to back and reduce the pain of relocation (I spent six months in Bay Area this way). There are certain disadvantages: less interns and corporate events during the semester, some schools require you to register for credits even if you are away (read as <q>you and/or your advisor will have to pay money and do some paperwork</q>). </li>
<li><b>Can I do an internship after my last year in grad school?</b> Yes if you graduate after the internship. However, it also depends on the place – Google wouldn't allow me to do this but Microsoft Research did.</li>
<li><b>What if I am on an F1 visa?</b> Then you have to jump through more paperwork hoops and in particular get a CPT. You can accumulate at most 12 months of CPT employment without losing your OPT, which limits the number of typical 3-month internships available to you down to 3 or 4.</li>
</ul>
</div>
</p>
<h1>Internship in Labs vs. Academia</h1>
<p>
In theoretical computer science there is not too much difference in the style of research between research labs and academia.
Also, there have been a lot of discussions online about advantages and disadvantages of each (e.g., <a href="http://greatresearch.org/2013/08/30/industry-or-academia-a-counterpoint/">here</a>, <a href="http://mybiasedcoin.blogspot.com/2009/09/research-labs-vs-academia.html">here</a>, <a href="http://matt-welsh.blogspot.com/2010/11/why-im-leaving-harvard.html ">here</a>, <a href="http://thmatters.wordpress.com/2014/10/14/letter-re-closing-of-microsoft-research-silicon-valley/">here</a> and following the links from there). However, from the intern's perspective these issues are less relevant, e.g. it is highly unlikely that a lab will be shut down during your internship.
<p>
From intern's point of view, I would say that a few obvious differences are:
<ul>
<div>
<li><b>More face time.</b> If you enjoy having long brainstorming sessions lasting for several hours every day then an industrial lab might be an ideal place for you. In academia you are unlikely to see your advisor more than twice a week for a couple of hours. This means that at an industrial lab you can make a lot of progress on one project in a very short period of time. In my experience, industrial researchers tend to have personalities suitable for thinking long hours on a deep problem together with an environment that lets them do this. In academia professors' busy schedules seem to interfere with research a lot and graduate students often work a lot either by themselves or with other students. Coming from team programming competitions background I really enjoyed these long brainstorming sessions in the labs.</li>
<li><b>Patents.</b> There are <a href="http://www.thisamericanlife.org/radio-archives/episode/441/when-patents-attack">many</a> <a href="http://www.thisamericanlife.org/radio-archives/episode/496/when-patents-attack-part-two">controversies</a> around patents, but ultimately they play a very important role at research labs (e.g. <a href="http://en.wikipedia.org/wiki/Nathan_Myhrvold ">Nathan Myhrvold</a>, the founder of the controversial <a href="http://en.wikipedia.org/wiki/Intellectual_Ventures">Intellectual Ventures</a>, was also the founder of Microsoft Research). While doing an internship, keep in mind that some parts of your research may later be filed as a patent. Depending on the company, you might get some money for this. Also, it is an interesting experience to see a paper converted into a patent by lawyers.
</li>
<li><b>Social aspects.</b> Researchers at labs work and interact with each other a lot more than professors do. This also includes going for lunch as a group and means that you can have lunch with some of the biggest stars in your field every day! You can do lots of other things together too, such as running, cycling, ping pong, etc. I got into triathlons during the group rides at IBM Almaden.
</li>
</div>
</ul>
<h1>Alternatives</h1>
If you can't get an internship at your dream lab you still have multiple options. While I haven't tried them, many of my friends did.
<p>
<div>
<ul>
<li><b>Unpaid Internship.</b> Unfortunately, not all great labs are well-funded. If you can't find a paid position but the lab is interested in working with you sometimes your advisor can pay you from their grant.</li>
<li><b>Visiting a Lab. </b> Sometimes you can get paid for a short or long visit (usually works only for well-funded labs). </li>
<li><b>Consulting.</b> This is a slightly unusual option for a graduate student, but some of my friends did this.
It probably works best if there is a lab next to the place where you live and does involve some paperwork.
Getting hired as a consultant is also sometimes a way to keep your access to the company's data after an internship if, say, you are still doing experiments for a paper you started while being at the lab.</li>
<li><b>Fellowship.</b> Many fellowships come together with internship opportunities. These are even better because you are not tied to a specific location/mentor. A <a href="http://www.cs.cmu.edu/~gradfellowships/">great list of fellowships</a> is maintained by CMU.</li>
<li><b>Visiting Another University.</b> Summer might be a good time to visit another university because professors are not teaching, although they might be traveling. Visiting during the Fall/Spring semester might be also good but for exactly the opposite reasons.</li>
</ul>
</div>
</p>
</p></p></p>
<p><a href="http://grigory.us/blog/research-internship/">Getting a Research Internship</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on December 20, 2014.</p><![CDATA[Massively Parallel Clustering: Overview]]>http://grigory.us/blog/mapreduce-clustering2014-11-02T00:00:00+00:002014-11-02T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<div>
<p>
Clustering is one of the main vechicles of machine learning and data analysis.
In this post I will describe how to make three very popular sequential clustering algorithms (<a href="http://en.wikipedia.org/wiki/K-means_clustering">k-means</a>, <a href="http://en.wikipedia.org/wiki/Single-linkage_clustering ">single-linkage clustering</a> and <a href="http://en.wikipedia.org/wiki/Correlation_clustering ">correlation clustering</a>) work for big data. The first two algorithms can be used for clustering a collection of feature vectors in \(d\)-dimensional Euclidean space (like the two-dimensional set of points on the picture below, while they also work for high-dimensional data). The last one can be used for arbitrary objects as long as for any pair of them one can define some measure of similarity.
</p>
<div align="center"><img alt="Massively Parallel Clustering" src="http://grigory.us/blog/pics/mapreduce-clustering.png" /></div>
<br />
<p>
Besides optimizing different objective functions these algorithms also give qualitatively different types of clusterings.
K-means produces a set of exactly k clusters. Single-linkage clustering gives a hierarchical partitioning of the data, which one can zoom into at different levels and get any desired number of clusters.
Finally, in correlation clustering the number of clusters is not known in advance and is chosen by the algorithm itself in order to optimize a certain objective function.
</p>
<p>
All algorithms described in this post use the <a href="http://grigory.us/blog/mapreduce-model/">model for massively parallel computation</a> that I described before.
</p>
<h1> K-Means</h1>
<br />
<p>
First algorithm is a parallel version of an approximation algorithm for <a href="http://en.wikipedia.org/wiki/K-means_clustering">K-Means</a>, one of the most widely used clustering methods.
Given a set of vectors \(v_1, \dots, v_n \in \mathbb R^d\) the goal of k-means is to partition them into \(k\) clusters \(S_1, \dots, S_k\) such that the following objective is minimized:
$$\sum_{i = 1}^k \sum_{j \in S_i} ||v_j - \mu_i||^2,$$ where \(\mu_i = \frac{1}{|S_i|}\sum_{j \in S_i} v_j\) is the center (or mean) of the \(i\)-th cluster and \(||\cdot||\) is the Euclidean distance.
Intuitively, the goal is to pick a partitioning that minimizes the total variance.
K-means works great for partitioning into compact groups like those on the picure below.
<div align="center"><img alt="K-Means" src="http://grigory.us/blog/pics/kmeans.png" /></div>
</p>
<h3>K-means++ and K-means||</h3>
<p>
An algorithm for k-means, which gives a clustering of cost within a multiplicative factor \(O(\log k)\) of the optimum was given by <a href="http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf">Arthur and Vassilvitskii</a>. Here is their algorithm called K-means++:
<div>
<ul>
<li> Let \(\mathcal C = \{v_i\}\) for a random vector \(v_i\).</li>
<li> Repeat \(k - 1\) times: let \(\mathcal C = \mathcal C \cup \{u\}\), where \(u\) is a random vector from the probability distribution assigning to each \(v_i\)probability density $$p_i(\mathcal C) = \frac{d(v_i, \mathcal C)^2}{\sum_{i} d(v_i, \mathcal C)^2},$$ where \(d(u,\mathcal C) =\min_{x \in \mathcal C}{||u - x||}.\)</li>
</ul>
</div>
</p>
<p>
However, K-means++ is sequential and takes at least \(k\) rounds.
A parallel version of it called K-means|| is due to <a href="http://arxiv.org/pdf/1203.6402.pdf ">Bahmani, Moseley, Vattani, Kumar and Vassilvitskii</a>:
<ul>
<li>Let \(\mathcal C = \{v_i\}\) for a random vector \(v_i\) </li>
<li>Let \(\psi = \sum_{i} d(v_i, \mathcal C)\) be the initial cost of the clustering.</li>
<li>Repeat \(O(\log \psi)\) times:</li>
<ul>
<li>Let \(\mathcal C'\) be a set of \(O(k)\) points each sampled independently from the distribution assigning to \(v_i\) probability \(p_i(\mathcal C)\) defined above. </li>
<!--assigning to \(v_i\) probability $$p_i = \frac{d(v_i, \mathcal C)^2}{\sum_i d(v_i, \mathcal C)^2}$$</li>
-->
<li> \(\mathcal C = \mathcal C \cup \mathcal C'\) </li>
</ul>
<li>For each \(x \in \mathcal C\), let \(w_x\) be the number of points belonging to this center</li>
<li>Recluster the weighted points in \(\mathcal C\) into \(k\) clusters using K-means++</li>
</ul>
</p>
<br />
<p>
The potential \(\psi\) can be shown to be at most \(poly(n)\) by discretizing the space by a grid with step size \(1/poly(n)\), and moving each point to the closest grid point, which only perturbs the cost of the solution by a negligible factor.
Thus we have \(\log \psi = O(\log n)\).
The final reclustering can be performed in one round, assuming that \(O(k \log n)\) weighted centers fit on a single machine.
Thus, the total number of rounds in the algorithm above is \(O(\log n)\).
It can be shown that the solution produced by this algorithm has cost within \(O(\log k)\) of the optimum.
For more details I would recommend to see either the original paper or these <a href="http://grigory.us/files/km++.pdf">slides</a> from our reading group by <a href="http://www.cis.upenn.edu/~wuzhiwei/ ">Steven Wu</a>.
</p>
<h1>Single-Linkage Clustering</h1>
<p>
<a href="http://en.wikipedia.org/wiki/Single-linkage_clustering">Single-linkage clustering</a> is another standard technique in data analysis and information retrieval. It can be used to produce a <a href="http://en.wikipedia.org/wiki/Hierarchical_clustering ">hierarhical clustering</a> of the data (see also <a href="http://nlp.stanford.edu/IR-book/pdf/17hier.pdf">Chapter 17</a> in the <a href="http://nlp.stanford.edu/IR-book/ ">Information Retrieval</a> book by Manning, Raghavan and Schutze and <a href="http://infolab.stanford.edu/~ullman/mmds/ch7a.pdf ">Chapter 7.2</a> in the <a href="http://www.mmds.org/">Mining of Massive Datasets</a> book by Leskovec, Rajaraman and Ullman).
</p>
<p>
For two clusters \(S_i\) and \(S_j\) the single linkage distance is defined as:
$$D(S_i, S_j) = \min_{v \in S_i, u \in S_j} d(v,u),$$
In general, \(d(\cdot, \cdot)\) can be an arbitrary distance function, but for points in Euclidean space it is most natural to use \(d(v,u) = ||v - u||\).
The goal of single linkage clustering in Euclidean space is to partition the set of vectors \(v_1, \dots, v_n \in \mathbb R^d\) into clusters \(S_1, \dots, S_k\) such that the following objective is maximized:
$$\min_{i < j} D(S_i, S_j) = \min_{i < j} \min_{v \in S_i, u \in S_j} ||v - u||.$$
</p>
<p>
In fact, it is easy to see that the set of clusters \(S_1, \dots S_k\), which maximizes the objective above can be obtained by constructing a Euclidean Minimum Spanning Tree and picking \(S_i\)'s as connected components of this tree obtained after removing \(k - 1\) of its longest edges.
Thus, single-linkage clustering works best for finding clusters defined by the connectivity structure. In particular, it can be used to solve the following example which is hard for K-means because points in the cluster are far from their average. This example is typically given as a motivation for using spectral clustering, which I don't discuss in this post, but it can be also addressed using single-linkage:
<div align="center"><img alt="Single Linkage" src="http://grigory.us/blog/pics/singlelinkage.jpg" /></div>
</p>
<p>
As expalined above, Euclidean Minimum Spanning Tree can be used to produce hierarchical Single-Linkage Clustering for any number of clusters.
However, it is not known how to efficiently compute an exact such tree in small number of rounds of MapReduce.
For any constant dimension \(d\) Euclidean Minimum Spanning Tree of cost within \((1 + \epsilon)\)-factor of optimum can be computed in constant number of rounds of MapReduce. This is a result from our joint paper with <a href="http://www.mit.edu/~andoni/ ">Alexandr Andoni</a>, <a href="http://onak.pl ">Krzysztof Onak</a> and <a href="http://paul.rutgers.edu/~anikolov/">Aleksandar Nikolov</a>, which appeared in STOC 2014.
I will cover this algorithm in one of the future posts, but for now you can use the <a href="http://grigory.us/files/talks/upenn14.pptx">slides</a> of my talk about it.
</p>
<h1>Correlation Clustering</h1>
<p>
<a href="http://en.wikipedia.org/wiki/Correlation_clustering">Correlation clustering</a>
can be used to cluster an arbitrary collection of \(n\) objects, so for this type of clustering it is not necessary that they can be represented by vectors in Euclidean space.
The only requirement is that for every pair of objects \(i\) and \(j\) it should be possible to compare them directly
and obtain a measure of dissimilarity \(w(i,j) \in [0,1]\). Here \(w(i,j) = 0\) means that the objects are exactly the same, while
\(w(i,j) = 1\) means that they are completely different and the values in between correspond to different degrees of dissimilarity.
</p>
<p>
The objective of correlation clustering is to minimize the total cost of mistakes incurred by the clustering.
For a set of clusters \(\mathcal C = \mathcal C_1, \dots, \mathcal C_k\) let the indicator function \(x(i,j)\) be defined as \(x(i,j) = 0\) if \(i\) and \(j\) are in the same cluster and \(x(i,j) = 1\) otherwise.
The total cost of the clustering is expressed as a function of w's and x's as follows:
$$\sum_{i < j \colon x(i,j) = 0} w(i,j) + \sum_{i < j \colon x(i,j) = 1} 1 - w(i,j).$$
Note that the number of clusters is not fixed and the algorithm has to choose it in order to optimize the objective function above.
The picture below (courtesy <a href="http://www.cs.yale.edu/homes/el327/ ">Edo Liberty</a>) uses edges to represent similar pairs (\(w(i,j) = 0\)) and non-edges for dissimilar pairs (\(w(i,j) = 1\)). On the right pairs misclassifed by the clustering are shown in red, so the overall cost of such clustering is equal to 4.
<div align="center"><img alt="Correlation Clustering" src="http://grigory.us/blog/pics/cc.png" /></div>
</p>
<p>
There exist many approximation algorithms for correlation clustering. In particular, using linear programming one can obtain a clustering of total cost within a multiplicative factor 2.5 of the optimum. This is a result of <a href="http://dimacs.rutgers.edu/~alantha/papers2/acn05conf.pdf">Ailon, Charikar and Newman</a>.
Moreover, if the weight function \(w\) satisfies triangle inequalities then the approximation of their algorithm becomes 2.
The linear programming relaxation for this problem is naturally formulated using triangle inequalities:
$$\text{Minimize: }\sum_{i<j} w(i,j)\cdot (1 - x(i,j)) + (1 - w(i,j)) \cdot x(i,j)$$
$$x(i,j) \le x(i,k) + x(k,j), \text{ } \forall i, j, k$$
$$0 \le x(i,j) \le 1$$
Note that for \(x(i,j) \in \{0,1\}\) this program exactly captures the correlation clustering problem.
Recently in joint work with <a href="http://pages.cs.wisc.edu/~shuchi/ ">Shuchi Chawla</a>, <a href="http://konstantin.makarychev.net/ ">Konstantin Makarychev</a> and <a href="http://www.cs.berkeley.edu/~tschramm/ ">Tselil Schramm</a> we have shown that there is a rounding scheme that achieves approximations 2.06 and 1.5 for these two cases, which is very close to the <a href="http://en.wikipedia.org/wiki/Linear_programming_relaxation#Approximation_and_integrality_gap ">integrality gaps</a> of this linear progarmming relaxation (2 and 1.2 for the general and triangle inequality cases respectively).
</p>
<p>
While the linear programming approach is hard to implement in MapReduce there is a very simple combinatorial algorithm due to Ailon, Charikar and Newman, which achieves a 3-approximation in general and a 2-approximation if the weights satisfy triangle inequalities.
First, define two sets of edges \(E^+ = \{(i,j) | w(i,j) < 1/2\}\) and \(E^- = \{(i,j) | w(i,j) \ge 1/2\}\). This means that we will treat pairs with dissimilarity below \(1/2\) as similar and those with dissimilarity at least \(1/2\) as dissimilar.
Now a set of clusters, which achieves the approximations stated above can be constructed using the following algorithm:
<ul>
<li>Pick a random object \(i\)</li>
<li>Set \(\mathcal C = \{i\}\), \(V' = \emptyset\)</li>
<li>For all \(j \neq i\):</li>
<ul>
<li>If \((i,j) \in E^+\) then add \(j\) to \(\mathcal C\)</li>
<li>Else if \((i,j) \in E^-\) then add \(j\) to \(V'\) </li>
</ul>
<li>Let \(G'\) be the subgraph induce by \(V'\)</li>
<li>Return clustering consisting of \(\mathcal C\) together with the set of clusters produced by this algorithm applied recursively to \(V'\)</li>
</ul>
</p>
<br />
<p>
The analysis of approximation achieved by this algorithm cleverly uses linear programming duality. However, this approach is very sequential in nature and might take \(O(n)\) rounds of MapReduce if implemented as is.
Recently there has been a paper in KDD 2014 by <a href="http://bit.ly/1zqPNzX">Chierichetti, Dalvi and Kumar</a> who show that a substantially modified version of the pivoting algorithm above achieves a \((3 + \epsilon)\)-approximation in \(O\left(\frac{\log n \log \Delta^+}{\epsilon}\right)\) rounds, where \(\Delta^+\) is the maximum degree in the graph induced by \(E^+\).
</p>
</div>
<p><a href="http://grigory.us/blog/mapreduce-clustering/">Massively Parallel Clustering: Overview</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on November 02, 2014.</p><![CDATA[Penn Big Data Reading Group]]>http://grigory.us/blog/big-data-reading2014-10-28T00:00:00+00:002014-10-28T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<div align="center"><img alt="UPenn Big Data Reading Group" src="http://grigory.us/blog/pics/upenn-big-data-reading.jpg" /></div>
<p>This semester I am running a reading group on <a href="http://grigory.us/big-data-reading.html">algorithms for big data</a> at UPenn.
The goal is to cover some of the most important papers in streaming and massively parallel computation, which came out in the last 5-10 years.
This area is evolving extremely fast, so I will be grateful for any suggestions about the papers missing from <a href="http://grigory.us/big-data-reading.html#topics">our list</a>.
Special thanks to <a href="https://sites.google.com/site/silviolattanzi/">Silvio Lattanzi</a> for a few suggestions he gave me while I was visiting Google NYC on Friday!</p>
<p>I also want to give a shout-out for Penn theory graduate students, who made presentations at the group meetings so much more enjoyable than I ever imagined. Thank you, <a href="http://www.cis.upenn.edu/~wuzhiwei/ ">Steven Wu</a>, <a href="http://www.cis.upenn.edu/~justhsu/ ">Justin Hsu</a>, <a href="http://hans.math.upenn.edu/~ryrogers/ ">Ryan Rogers</a> and <a href="http://www.seas.upenn.edu/~sassadi/ ">Sepehr Assadi</a>! I guess, some of them are going to hit the internship job market soon. Keep an eye if you are looking to hire in algorithms for big graphs and other data.</p>
<p><a href="http://grigory.us/blog/big-data-reading/">Penn Big Data Reading Group</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on October 28, 2014.</p><![CDATA[Model for Massively Parallel Computation]]>http://grigory.us/blog/mapreduce-model2014-10-12T00:00:00+00:002014-10-12T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.us/bloggrigory@grigory.us<p>In this post I will introduce a theoretical model for computation in centralized distributed massively parallel computational systems (or in short clusters like those used by Google many other companies). Over the last decades the supercomputer architecture has moved towards such designs and there seem to be no signs of this trend slowing (see Wikipedia <a href="http://en.wikipedia.org/wiki/Supercomputer_architecture">article</a> for more information).</p>
<div align="center"><img alt="MapReduce cluster" src="http://grigory.us/blog/pics/cluster.png" /></div>
<h1 id="mapreduce-style-computation">MapReduce-style Computation</h1>
<p><a href="http://en.wikipedia.org/wiki/MapReduce ">MapReduce</a> is a programming model for cluster computing introduced by <a href="http://en.wikipedia.org/wiki/Jeff_Dean_(computer_scientist)">Jeff Dean</a> and <a href="http://research.google.com/pubs/SanjayGhemawat.html ">Sanjay Ghemawat</a> in their <a href="https://www.usenix.org/legacy/publications/library/proceedings/osdi04/tech/full_papers/dean/dean_html/">seminal paper</a>.
There exist <a href="http://en.wikipedia.org/wiki/MapReduce#Implementations_of_MapReduce ">multiple different implementations</a> of MapReduce, <a href="http://hadoop.apache.org/ ">Apache Hadoop</a> being one of the most popular among them.</p>
<div align="center"><img alt="MapReduce Hadoop" src="http://grigory.us/blog/pics/hadoop-mapreduce.jpg" /></div>
<p>Below I will describe a theoretical version of the model for MapReduce-style computation.
This model is easy to understand, avoiding low-level technical details involved in the implementation of the MapReduce model.
For those familiar with the standard MapReduce implementations, which use key-value pairs and Map/Shuffle/Reduce phases,
let me just say that these two are interchangeable abstractions of the same thing.</p>
<p>This model has emerged in a sequence of papers:</p>
<ul>
<li>Jon Feldman, S. Muthukrishnan, Anastasios Sidiropoulos, Clifford Stein, Zoya Svitkina: <a href="http://webdocs.cs.ualberta.ca/~svitkina/pub/mr-talg.pdf">On distributing symmetric streaming computations</a>. SODA 2008.</li>
<li>Howard J. Karloff, Siddharth Suri, Sergei Vassilvitskii: <a href="http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf ">A Model of Computation for MapReduce</a>. SODA 2010.</li>
<li>Michael T. Goodrich, Nodari Sitchinava, Qin Zhang:<a href="http://arxiv.org/pdf/1101.1902.pdf"> Sorting, Searching, and Simulation in the MapReduce Framework</a>. ISAAC 2011</li>
<li>Paul Beame, Paraschos Koutris, Dan Suciu: <a href="http://arxiv.org/pdf/1306.5972.pdf"> Communication steps for parallel query processing</a>. PODS 2013</li>
</ul>
<h2 id="storage-model">Storage Model</h2>
<p>First, let’s discuss the data storage.
Data of size <script type="math/tex">N</script> is partitioned between <script type="math/tex">M</script> identical machines.
Each machine is the standard RAM machine with <script type="math/tex">S</script> bits of RAM.
The data fits into the overall memory with possibly some extra memory left for the algorithm to use so that <script type="math/tex">M \times S = C \times N</script>, where <script type="math/tex">C</script> is an overhead replication factor. Unless otherwise specified, replication will be constant, i.e. <script type="math/tex">C = O(1)</script> so I will ignore it.</p>
<p>Without loss of generality I will assume that <script type="math/tex">M = O(N^\alpha)</script> and <script type="math/tex">S = O(N^{1 - \alpha})</script>. Here <script type="math/tex">\alpha</script> is a constant, which is typically significantly greater than zero, but less than <script type="math/tex">1/2</script> (think of a cluster with thousands of machines, each having gigabytes of RAM).</p>
<div align="center"><img alt="MapReduce Storage" src="http://grigory.us/blog/pics/mr-storage.png" /></div>
<h2 id="computational-steps">Computational Steps</h2>
<p>The key parameter in the study of massively parallel algorithms is the number of supersteps (or rounds) of computation.
The entire computation is divided into such rounds, each consisting of two phases:</p>
<ul>
<li>
<p><b>Local computation phase</b>.
In this phase each machine performs a local computation based on its data.
This computation should be as efficient as possible (ideally linear or close to linear time, sometimes allowing polynomial time for particularly hard problems). Typically local running times for all machines will be identical at a given round so let’s denote them as <script type="math/tex">T_i(S)</script> at round <script type="math/tex">i</script>.</p>
</li>
<li>
<p><b>Communication phase</b>.
In the communication phase each machine can send and receive at most <script type="math/tex">S</script> bits of information.
The limitation on received data comes from the memory bound of every machine.
Note that this doesn’t allow, say, streaming computations to be performed on the fly on the incoming data. The limitation on sent data comes from the technical details of the MapReduce framework. For those familiar with the low-level details I will just say that the key-value pairs have to be stored locally before they get redistributed between machines.</p>
</li>
</ul>
<div align="center"> <img alt="MapReduce Computation Diagram" src="http://grigory.us/blog/pics/mr-computation-diagram.png" /></div>
<h2 id="number-of-rounds">Number of Rounds</h2>
<p>Overall, if the number of rounds is <script type="math/tex">R</script> then the total local computational time is <script type="math/tex">\sum_{i = 1}^R T_i(S)</script>. The total communication time is <script type="math/tex">R \times CC(N)</script>, where <script type="math/tex">CC(N)</script> is the time it takes to redistribute the data between machines in each round.
This parameter is dependent on the under the hood implementation of the system so I will assume it as given.</p>
<p>For example, if local running times are linear then we get total running time of <script type="math/tex">R \times (O(S)+ CC(N))</script>. This emphasizes the number of rounds <script type="math/tex">R</script> as the key parameter for understanding the complexity of algorithms in MapReduce-like systems.
Other considertaions, such as fault-tolerance, also suggest that ideally we would like to have just a few rounds. So having <script type="math/tex">O(1)</script> rounds is great, while <script type="math/tex">O(\log N)</script> rounds might be also ok for some problems.</p>
<div align="center"><img src="http://grigory.us/blog/pics/rounds.png" /></div>
<h2 id="examples">Examples</h2>
<p>Let’s look at some examples of how many rounds it takes to solve some basic problems:</p>
<ul>
<li><b>Sorting.</b> <script type="math/tex">O(\log_S N) = O(1)</script> rounds suffice to sort <script type="math/tex">N</script> numbers.
This is a result from: Michael T. Goodrich, Nodari Sitchinava, Qin Zhang:<a href="http://arxiv.org/pdf/1101.1902.pdf"> Sorting, Searching, and Simulation in the MapReduce Framework</a>. ISAAC 2011.</li>
<li><b>Connectivity.</b> <script type="math/tex">O(\log N)</script> rounds suffice to check whether a graph with <script type="math/tex">N</script> edges is connected or not.
This is a result from: Howard J. Karloff, Siddharth Suri, Sergei Vassilvitskii: <a href="http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf ">A Model of Computation for MapReduce</a>. SODA 2010.</li>
</ul>
<p>In practice it takes two rounds for a terabyte dataset using <a href="http://sortbenchmark.org/YahooHadoop.pdf">TeraSort</a>, which uses essentially the same algorithm as the theoretical <script type="math/tex">O(\log_S N)</script>-round algorithm mentioned above.
Here is a simplified version:</p>
<ul>
<li>Take a random sample of size <script type="math/tex">M - 1</script>.</li>
<li>Assuming that <script type="math/tex">M \le S</script> in the first round we can sort this sample locally on one of the machines, obtaining a sequence <script type="math/tex">a_1 \le a_2 \le \dots \le a_{M-1}</script>.</li>
<li>In the second round send all keys in the range <script type="math/tex">[a_{i - 1}, a_i)</script> to the <script type="math/tex">i</script>-th machine and sort them locally on that machine.</li>
</ul>
<p>The connectivity algorithm is more complex so I will describe it in more detail below.</p>
<h2 id="connectivity-in-olog-n-rounds">Connectivity in <script type="math/tex">O(\log N)</script> rounds</h2>
<p>The data consists of <script type="math/tex">N</script> edges of an undirected graph on the vertex set <script type="math/tex">V</script>.
The goal is to compute the connected components of this graph.
For every vertex <script type="math/tex">v \in V</script> let <script type="math/tex">\pi(v)</script> be its unique integer id (a number between <script type="math/tex">1</script> and <script type="math/tex">|V|</script>).
During the algorithm we will also maintain a label <script type="math/tex">\ell(v)</script> for each vertex <script type="math/tex">v</script>.
Let <script type="math/tex">L_v \subseteq V</script> be the set of vertices with the label <script type="math/tex">\ell(v)</script>.
During the execution of the algorithm this set will be a subset of the connected component containing <script type="math/tex">v</script>.
We will use <script type="math/tex">\Gamma(v)</script> and <script type="math/tex">\Gamma(S)</script> to denote the set of neighbors of a vertex <script type="math/tex">v</script> and a subset of vertices <script type="math/tex">S \subseteq V</script> respectively.</p>
<p>Here is a high-level description of the algorithm. I will call some of the vertices active. The idea is that every set <script type="math/tex">L_v</script> of vertices with the same label according to <script type="math/tex">\ell</script> will have exactly one active vertex during the execution of the algorithm.</p>
<ul>
<li>Mark every vertex <script type="math/tex">v \in V</script> as <b>active</b> and label <script type="math/tex">\ell(v) = v</script>.</li>
<li>For phases <script type="math/tex">i = 1, 2, \dots, O(\log N)</script> do:
<ul>
<li>Call each <b>active</b> vertex a <b>leader</b> with probability <script type="math/tex">1/2</script>. If <script type="math/tex">v</script> is a <b>leader</b>, mark all vertices in <script type="math/tex">L_v</script> as <b>leaders</b>.</li>
<li>For every <b>active non-leader</b> vertex <script type="math/tex">w</script>, find the smallest <b>leader</b> (with respect to <script type="math/tex">\pi</script>) vertex <script type="math/tex">w^{\star} \in \Gamma(L_w)</script>.</li>
<li>If <script type="math/tex">w^{\star}</script> is not empty, mark <script type="math/tex">w</script> <b>passive</b> and relabel each vertex with label <script type="math/tex">w</script> by <script type="math/tex">w^{\star}</script>.</li>
</ul>
</li>
<li>Output the set of connected components, where vertices having the same label according to <script type="math/tex">\ell</script> are in the same component.</li>
</ul>
<p>It is easy to see if for two vertices <script type="math/tex">u</script> and <script type="math/tex">v</script> it holds that <script type="math/tex">\ell(u) = \ell(v)</script> then <script type="math/tex">u</script> and <script type="math/tex">v</script> are in the same connected component. It remains to show that every connected component will have a unique label with high probability after <script type="math/tex">O(\log N)</script> phases. We will show that for every connected component in the graph the number of active vertices in this component reduces by a constant factor in every phase.
Indeed, half of the active vertices in every component is declared as non-leaders.
Fix an active non-leader vertex <script type="math/tex">v</script>. If there are at least two different labels in the connected component containing <script type="math/tex">v</script> then there exists an edge <script type="math/tex">(v', u')</script> such that <script type="math/tex">\ell(u) = \ell(u')</script> and <script type="math/tex">\ell(u) \neq \ell(v)</script>.
The vertex <script type="math/tex">u'</script> is marked as a leader with probability <script type="math/tex">1/2</script> so in expectation half of the active non-leader vertices will change their label in every phase. Overall, we expect <script type="math/tex">1/4</script> of labels to disappear. By a <a href="http://en.wikipedia.org/wiki/Chernoff_bound">Chernoff bound</a> after <script type="math/tex">O(\log N)</script> phases the number of active labels in every connected component will drop to one with high probability.</p>
<p>Finally, I will leave it as an excercise to check that each phase of the algorithm above can be implemented in constant number of rounds. Indeed, it is not hard to see that selection of leaders, computation of <script type="math/tex">w^{\star}</script> (the smallest label in <script type="math/tex">\Gamma(L_w)</script> for active non-leader nodes <script type="math/tex">w</script>) and relabeling can all be done in constant number of rounds.</p>
<h2 id="open-problem">Open Problem</h2>
<p>Is it possible to solve connectivity in constant number of rounds? This is a big open problem in the area and the consensus seems to be that this is not possible. In fact, it is open even whether one can distinguish a cycle on <script type="math/tex">N</script> vertices from two cycles on <script type="math/tex">N/2</script> vertices each in constant number of rounds.</p>
<div align="center"><img alt="Connectivity" src="http://grigory.us/blog/pics/connectivity.png" /></div>
<p><a href="http://grigory.us/blog/mapreduce-model/">Model for Massively Parallel Computation</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.us/blog">The Big Data Theory</a> on October 12, 2014.</p>