In this post I will introduce a theoretical model for computation in centralized distributed massively parallel computational systems (or in short clusters like those used by Google many other companies). Over the last decades the supercomputer architecture has moved towards such designs and there seem to be no signs of this trend slowing (see Wikipedia article for more information).
MapReduce is a programming model for cluster computing introduced by Jeff Dean and Sanjay Ghemawat in their seminal paper. There exist multiple different implementations of MapReduce, Apache Hadoop being one of the most popular among them.
Below I will describe a theoretical version of the model for MapReduce-style computation. This model is easy to understand, avoiding low-level technical details involved in the implementation of the MapReduce model. For those familiar with the standard MapReduce implementations, which use key-value pairs and Map/Shuffle/Reduce phases, let me just say that these two are interchangeable abstractions of the same thing.
This model has emerged in a sequence of papers:
- Jon Feldman, S. Muthukrishnan, Anastasios Sidiropoulos, Clifford Stein, Zoya Svitkina: On distributing symmetric streaming computations. SODA 2008.
- Howard J. Karloff, Siddharth Suri, Sergei Vassilvitskii: A Model of Computation for MapReduce. SODA 2010.
- Michael T. Goodrich, Nodari Sitchinava, Qin Zhang: Sorting, Searching, and Simulation in the MapReduce Framework. ISAAC 2011
- Paul Beame, Paraschos Koutris, Dan Suciu: Communication steps for parallel query processing. PODS 2013
First, let’s discuss the data storage. Data of size is partitioned between identical machines. Each machine is the standard RAM machine with bits of RAM. The data fits into the overall memory with possibly some extra memory left for the algorithm to use so that , where is an overhead replication factor. Unless otherwise specified, replication will be constant, i.e. so I will ignore it.
Without loss of generality I will assume that and . Here is a constant, which is typically significantly greater than zero, but less than (think of a cluster with thousands of machines, each having gigabytes of RAM).
The key parameter in the study of massively parallel algorithms is the number of supersteps (or rounds) of computation. The entire computation is divided into such rounds, each consisting of two phases:
Local computation phase. In this phase each machine performs a local computation based on its data. This computation should be as efficient as possible (ideally linear or close to linear time, sometimes allowing polynomial time for particularly hard problems). Typically local running times for all machines will be identical at a given round so let’s denote them as at round .
Communication phase. In the communication phase each machine can send and receive at most bits of information. The limitation on received data comes from the memory bound of every machine. Note that this doesn’t allow, say, streaming computations to be performed on the fly on the incoming data. The limitation on sent data comes from the technical details of the MapReduce framework. For those familiar with the low-level details I will just say that the key-value pairs have to be stored locally before they get redistributed between machines.
Number of Rounds
Overall, if the number of rounds is then the total local computational time is . The total communication time is , where is the time it takes to redistribute the data between machines in each round. This parameter is dependent on the under the hood implementation of the system so I will assume it as given.
For example, if local running times are linear then we get total running time of . This emphasizes the number of rounds as the key parameter for understanding the complexity of algorithms in MapReduce-like systems. Other considertaions, such as fault-tolerance, also suggest that ideally we would like to have just a few rounds. So having rounds is great, while rounds might be also ok for some problems.
Let’s look at some examples of how many rounds it takes to solve some basic problems:
- Sorting. rounds suffice to sort numbers. This is a result from: Michael T. Goodrich, Nodari Sitchinava, Qin Zhang: Sorting, Searching, and Simulation in the MapReduce Framework. ISAAC 2011.
- Connectivity. rounds suffice to check whether a graph with edges is connected or not. This is a result from: Howard J. Karloff, Siddharth Suri, Sergei Vassilvitskii: A Model of Computation for MapReduce. SODA 2010.
In practice it takes two rounds for a terabyte dataset using TeraSort, which uses essentially the same algorithm as the theoretical -round algorithm mentioned above. Here is a simplified version:
- Take a random sample of size .
- Assuming that in the first round we can sort this sample locally on one of the machines, obtaining a sequence .
- In the second round send all keys in the range to the -th machine and sort them locally on that machine.
The connectivity algorithm is more complex so I will describe it in more detail below.
Connectivity in rounds
The data consists of edges of an undirected graph on the vertex set . The goal is to compute the connected components of this graph. For every vertex let be its unique integer id (a number between and ). During the algorithm we will also maintain a label for each vertex . Let be the set of vertices with the label . During the execution of the algorithm this set will be a subset of the connected component containing . We will use and to denote the set of neighbors of a vertex and a subset of vertices respectively.
Here is a high-level description of the algorithm. I will call some of the vertices active. The idea is that every set of vertices with the same label according to will have exactly one active vertex during the execution of the algorithm.
- Mark every vertex as active and label .
- For phases do:
- Call each active vertex a leader with probability . If is a leader, mark all vertices in as leaders.
- For every active non-leader vertex , find the smallest leader (with respect to ) vertex .
- If is not empty, mark passive and relabel each vertex with label by .
- Output the set of connected components, where vertices having the same label according to are in the same component.
It is easy to see if for two vertices and it holds that then and are in the same connected component. It remains to show that every connected component will have a unique label with high probability after phases. We will show that for every connected component in the graph the number of active vertices in this component reduces by a constant factor in every phase. Indeed, half of the active vertices in every component is declared as non-leaders. Fix an active non-leader vertex . If there are at least two different labels in the connected component containing then there exists an edge such that and . The vertex is marked as a leader with probability so in expectation half of the active non-leader vertices will change their label in every phase. Overall, we expect of labels to disappear. By a Chernoff bound after phases the number of active labels in every connected component will drop to one with high probability.
Finally, I will leave it as an excercise to check that each phase of the algorithm above can be implemented in constant number of rounds. Indeed, it is not hard to see that selection of leaders, computation of (the smallest label in for active non-leader nodes ) and relabeling can all be done in constant number of rounds.
Is it possible to solve connectivity in constant number of rounds? This is a big open problem in the area and the consensus seems to be that this is not possible. In fact, it is open even whether one can distinguish a cycle on vertices from two cycles on vertices each in constant number of rounds.