### Grigory Yaroslavtsev

Founding Director of the Center for Algorithms and Machine Learning, Assistant Professor of Computer Science, Indiana University.

# What's New in the Big Data Theory 2017

This year I will continue the tradition started last year and summarize a few papers on efficient algorithms for big data that caught my attention last year. Same disclaimers as last year apply and this is by no means supposed to be the list of “best” papers in the field which is quite loosely defined anyway (e.g. I will intentionally avoid deep learning and gradient descent methods here as I am not actively working in these areas myself and there are a lot of resources on these topics already). In particular, this year it was even harder to pick clear favorites so it is even more likely that I have missed some excellent work. Below I will assume familiary with the basics of streaming algorithms and the massively parallel computation model (MPC) discussed in an earlier post.

Before we begin let me quickly plug some of my own work from last year. With my student Adithya Vadapalli we have a new paper Massively Parallel Algorithms and Hardness of Single-Linkage Clustering under $\ell_p$-distances’’. As it turns out, while single-linkage clustering and minimum spanning tree problems are the same for exact computation, for vector data round complexity of approximating these two problems in the MPC model is quite different. In another paper I introduce a study of approximate binary linear sketching of valuation functions. This is an extension of our recent study of binary linear sketching to the case when the function of interest should only be computed approximately.

## New Massively Parallel Algorithms for Matchings

Search for new algorithms for matchings has lead to development of new algorithmic ideas for many decades (motivating the study of the class P of polynomial-time algorithms) and this year is no exception. Two related papers on matchings caught my attention this year:

Both papers are highly technical but achieve similar results. The first paper gives an $O((\log \log |V|)^2)$-round MPC algorithm for the maximum matching problem that uses $O(|V|)$ memory per machine. The second paper improves the number of rounds down to $O(\log \log |V|)$ using slightly larger memory $O(|V| polylog (|V|))$ per machine. Using a standard reduction mentioned in the latter paper both papers can achieve multiplicative $(1+\epsilon)$-approximation for any constant $\epsilon > 0$. These results should be contrasted with the previous work by Lattanzi, Moseley, Suri and Vassilvitskii who give $1/c$-round algorithms at the expense of using $O(|V|^{1 + c})$ memory per machine for any constant $c > 0$. Overall, this is remarkable progress but likely not the end of the story.

## Massively Parallel Methods for Dynamic Programming

Dynamic programming, pioneered by Bellman at RAND, is one of the key techniques in algorithm design. Some would even go as far as saying that there are only two algorithmic tecniques and dynamic programming is one of them. However, dynamic programming programming is notoriously sequential and difficult to use for sublinear time/space computation. Most successful stories of speeding up dynamic programming so far have been problem-specific and often highly non-trivial.

In their paper “Efficient Massively Parallel Methods for Dynamic Programming” (STOC’17) Im, Moseley and Sun suggest a fairly generic approach for designing massively parallel dynamic programming algorithms. Three textbook dynamic programming problems can be handled within their framework:

• Longest Increasing Subsequence: multiplicative $(1+\epsilon)$-approximation in $O(1/\epsilon^2)$ rounds of MPC.
• Optimal Binary Search Tree: multiplicative $(1+\epsilon)$-approximation in $O(1)$ rounds of MPC.
• Weighted Interval Scheduling: multiplicative $(1+\epsilon)$-approxiamtion in $O(\log 1/\epsilon)$ rounds of MPC.

On a technical level this paper identifies two key properties that these problems have in common: monotonicity and decmoposability. Montonicity just requires that the answer to a subproblem should always be at most (for maximization)/at least(for minimization) the answer to the problem itself. Decomposability is more subtle and requires that the problem can be decomposed into a two-level recursive family of subproblems where entries of the top level are called groups and entries of the bottom level are called blocks. It should then be possible to 1) construct a nearly optimal solution for the entire problem by concatenating solutions for subproblems, 2) construct a nearly optimal solution for each group from only a constant number of blocks. While monotonicity holds for many standard dynamic problems, decomposability seems much more restrictive so it is interesting to see whether this technique can be extended to some other problems.

See Ben Moseley’s presentation at the Midwest Theory Day for more details.

## Randomized Composable Coresets for Matching and Vertex Cover

The simplest massively parallel algorithm one can think of can be described as follows: partition the data between $k$ machines, let each machine select a small subset of the data points, collect these locally selected data points on one central machine and compute the solution there. The hardest part here is the design of the local subset selection procedures. Such subsets are called coresets and have received a lot of attention the study of algorihtms for high-dimensional vectors, see e.g. this survey by Agarwal, Har-Peled and Varadarajan.

Note that in the distributed setting construction of coresets can be affected by the initial distribution of data. In fact, for the maximum mathcings problem non-trivially small coresets can’t be used to approximate the maximum matching up to a reasonable error (see our paper with Assadi, Khanna and Li for the exact statement which in fact rules out not just coresets but any small-space representations) if no assumptions about the distribution of the data is made.

However, if the initial distribution of data is uniformly random then the situation changes quite dramatically. As shown in “Randomized Composable Coresets for Matching and Vertex Cover” by Assadi and Khanna (best paper at SPAA’17) for uniformly distributed data coresets of size $O(|V| polylog |V|)$ can be computed locally and then combined to obtain $O(polylog |V|)$-approximation.

## Optimal Lower Bounds for Lp-Samplers, etc

Consider the following problem: a vector $x \in \{0,1\}^n$ (initially consisting of all zeros) is changed by a very long sequence of updates that can flip an arbitrary coordinate of this vector. After seeing this sequence of updates can we retrieve some non-zero entry of this vector without storing all $n$ bits used to represent $x$? Surprisingly, the answer is “yes” and this can be done with only $O(poly(\log n))$ space. If we are required to generate a uniformly random non-zero entry of $x$ then the corresponding problem is called $L_0$-sampling.

$L_0$-sampling turns out to be a remarkably useful primitive in the design of small-space algorithms. Almost all known streaming algorithms for dynamically changing graphs are based on $L_0$-sampling or its relaxation where the uniformity requirement is removed.

While almost optimal upper and lower bounds on the amount of space necessary for $L_0$-sampling have been known since the work of Jowhari, Saglam and Tardos, there were still gaps in terms of the dependence on success probability. If our recovery of a non-zero entry of $x$ has to be successful with probability $1 - \delta$ then the tight bound on space turns out to be $\Theta(\min(n, \log (1/\delta) \log^2 (\frac{n}{\log 1/\delta})))$. This is one of the results of the recent FOCS’17 paper by Kapralov, Nelson, Pahocki, Wang, Woodruff and Yahyazadeh.

## Looking forward to more results in 2018!

Please, let me know if there are any other interesting papers that I missed. Also here is a quick shout out goes to some other papers that were close to making the above list: