The Big Data Theory

Theory Jobs 2024

2024-05-25T00:00:00+00:00

Here is a link to a crowdsourced spreadsheet created to collect information about theory hires this year. Rules for the spreadsheet have been copied from previous years and all edits to the document are anonymized. Please, feel free to contact me directly or post a comment if you have any suggestions about the rules.

You are welcome to add yourself, or people your department has hired.
Separate sheets for faculty, industry and postdocs/visitors.
Hires should be connected to theoretical computer science, broadly defined.
Only add jobs that you are absolutely sure have been offered and accepted. This is not the place for speculation and rumors. Please, be particularly careful when adding senior hires (people who already have an academic or industrial job) -- end dates of their current positions might be still in the future.

Theory Jobs 2024 was originally published by Grigory Yaroslavtsev at The Big Data Theory on May 25, 2024.

Theory Jobs 2023

2023-05-25T00:00:00+00:00

You are welcome to add yourself, or people your department has hired.
Separate sheets for faculty, industry and postdocs/visitors.
Hires should be connected to theoretical computer science, broadly defined.
Only add jobs that you are absolutely sure have been offered and accepted. This is not the place for speculation and rumors. Please, be particularly careful when adding senior hires (people who already have an academic or industrial job) -- end dates of their current positions might be still in the future.

Theory Jobs 2023 was originally published by Grigory Yaroslavtsev at The Big Data Theory on May 25, 2023.

Theory Jobs 2022

2022-06-11T00:00:00+00:00

You are welcome to add yourself, or people your department has hired.
Separate sheets for faculty, industry and postdocs/visitors.
Hires should be connected to theoretical computer science, broadly defined.
Only add jobs that you are absolutely sure have been offered and accepted. This is not the place for speculation and rumors. Please, be particularly careful when adding senior hires (people who already have an academic or industrial job) -- end dates of their current positions might be still in the future.

Theory Jobs 2022 was originally published by Grigory Yaroslavtsev at The Big Data Theory on June 11, 2022.

Theory Jobs 2021

2021-06-12T00:00:00+00:00

While in theory bipartite matching should be easy, it has been observed in practice that real instances of matching theoreticians with jobs are hard. It’s been a particularly unusual year with the entire cycle going virtual. Congrats to matched vertices on both sides!

Here is a link to a crowdsourced spreadsheet created to collect information about theory hires this year. I put in a biased pseudorandom seed, please help populate and share! Rules for the spreadsheet have been copied from previous years and all edits to the document are anonymized. Please, feel free to contact me directly or post a comment if you have any suggestions about the rules.

You are welcome to add yourself, or people your department has hired.
Separate sheets for faculty, industry and postdocs/visitors.
Hires should be connected to theoretical computer science, broadly defined.
Only add jobs that you are absolutely sure have been offered and accepted. This is not the place for speculation and rumors. Please, be particularly careful when adding senior hires (people who already have an academic or industrial job) -- end dates of their current positions might be still in the future.

Theory Jobs 2021 was originally published by Grigory Yaroslavtsev at The Big Data Theory on June 12, 2021.

Theory Jobs 2020

2020-06-14T00:00:00+00:00

It’s been an unusually challenging year for both sides of the TCS job market with some unexpected obstacles and delays. Apologies for putting up the spreadsheet later than usual and congrats to both sides in each converged process!

Here is a link to a crowdsourced spreadsheet created to collect information about theory jobs this year. I put in a biased pseudorandom seed, please help populate and share! Rules for the spreadsheet have been copied from previous years (with one substantial suggestion regarding senior hires based on one of my friends’ recommendation, see below) and all edits to the document are anonymized. Please, post a comment if you have any suggestions about the rules.

Separate sheets for faculty, industry and postdocs/visitors.
People should be connected to theoretical computer science, broadly defined.
Only add jobs that you are absolutely sure have been offered and accepted. This is not the place for speculation and rumors. New: Please, be particularly careful when adding senior hires (people who already have an academic or industrial job) -- end dates of their current positions might be still in the future.
You are welcome to add yourself, or people your department has hired.

Theory Jobs 2020 was originally published by Grigory Yaroslavtsev at The Big Data Theory on June 14, 2020.

How I Spent Last Summer FAQ

2019-10-05T00:00:00+00:00

I get a lot of questions about how I spent last summer. Normally I just take off to the Bay Area the day my last Spring class is over and fly back the day before my Fall class begins. However, last summer I decided I’ve been in the US long enough to learn everything it has to offer and it was time to explore life across the pond and spend three months at the Alan Turing Institute in London. Then I had two interns coming over to Bloomington so I spent my first ever summer month here. Since it is that time of year, a quick reminder to apply by Dec 15 if you are interested in doing a Ph.D. and stay tuned for the internship call announcement (probably similar deadline).

Summer Interns in Bloomington

IU has started a Global Talent Attraction Program (GTAP) – fantastic program for international summer interns. The program gives you a $4000 stipend and you spend 2 months here at IU. There were a lot of strong applicants so it took me a while to interview all candidates. In the end, the two interns I got were Jakub Boguta (U. Warsaw, ACM ICPC gold this year, must be tough to be in the lead for 4 hours and not win) and Stanislav Naumov (SPb ITMO, ACM ICPC finalist, who spent summer at Google and just arrived on campus). Also, Farid Arthaud joined us from ENS Paris, Ulm with a short recommendation of being “probably the best third-year CS student in France”. If you think you are the best in your country, have U.S. citizenship and don’t need to get paid, shoot me an email ;) Despite it being hot and humid here in Bloomington during the summer, we had a great time.

We decided to dive into deep learning for image classification and figured out how to get more mileage out of standard pretrained neural nets by using them to produce hierarchical clusterings (with guarantees). If this sounds fun, you can apply for GTAP next year (picture by Farid).

London and the Alan Turing Institute

Overall, this was a great experience as it quickly became clear that my neural net is overfit to the US lifestyle. I think of UK as throwing in some perturbations to your visual and verbal input (some may seem adversarial, but mostly just random) which, as we know, is good for robustness, generalization and what not.

Q: Is grass greener there? A: Yes, of course. Especially, if you live next to the Regent’s Park.
Q: Is it your cup of tea? A: No, I still only function on Redbull, but the afternoon teas are a great experience. Proximity to cutting-edge tech, CS research and startups still matters most to me. However, if you are into math or finance, your mileage will almost certainly vary. Also, London seems perfect for a short-term visit/sabbatical, especially if you want to take a break from the tech hype, write a book, explore Europe, etc.
Q: What’s up with the Alan Turing Institute and DeepMind? A: These two are probably the most happening places in the UK right now in academia and industry respectively. They are within a 5-minute walk from each other in King’s Cross. I was staying right across the road and it was perfect except for no AC. ATI serves as a meeting hub for researchers from all of the top UK schools (Cambridge, Oxford, Warwick, UCL, Edinburgh, etc.). ATI is based inside the British library, which was the largest public building constructed in the UK in the 20th century. ATI has its own space inside the library which is equipped similarly to Google/FB offices. Except no free food, only drinks – would you want to have free British food anyway?

Q: Is Shoreditch the most hip neighborhood? A: I think so, best Sci-Fi graffiti ever.

Q: Did you meet the King? A: Yes, in Heathrow I ran into a 250-pound dude from Atlanta who made it quite clear that’s him by wearing one of these (except in a larger font and in dirty red color).

Q: Is Paris still Paris? A: I think so (my third time). K and I took a 2-hour train down there directly from King’s Cross (St. Pancras station, another reason to stay in King’s Cross). We’ve enjoyed our time greatly, especially in Versailles and ENS Paris, Ulm. The Salvador Dali Museum in Montmartre was another highlight of this trip.

Q: Brexit, Boris Johnson? A: Locals made fun of me for having never heard of Boris Johnson. Is there much to know anyway?

How I Spent Last Summer FAQ was originally published by Grigory Yaroslavtsev at The Big Data Theory on October 05, 2019.

Theory Jobs 2019

2019-05-30T00:00:00+00:00

Apparently, it’s a busy life being an assistant prof so there were no posts here all year. However, while some of us are decompressing after the NeurIPS deadline, here is a link to a crowdsourced spreadsheet created to collect information about theory jobs this year. Congratulations to both job seekers and departments/labs who are done with their searches!

In the past my academic uncle Lance Fortnow set this spreadsheet up (check this link to his post from two years ago which also has links to all the previous years). This year the first entry is Lance himself who is moving back to Chicago to be the Dean of the College of Science at the Illinois Institute of Technology. Did Lance get the idea from his advisor Michael Sipser who is also a Dean of Science but at MIT? In any case, great to see theoretical computer scientists stepping up to be the deans of science, congratulations!

Rules about the spreadsheet have been copied from last years and all edits to the document are anonymized. Please, post a comment if you have any suggestions about the rules.

Separate sheets for faculty, industry and postdocs/visitors.
People should be connected to theoretical computer science, broadly defined.
Only add jobs that you are absolutely sure have been offered and accepted. This is not the place for speculation and rumors.
You are welcome to add yourself, or people your department has hired.

This document will continue to grow as more jobs settle.

Theory Jobs 2019 was originally published by Grigory Yaroslavtsev at The Big Data Theory on May 30, 2019.

Theory Jobs 2018

2018-05-25T00:00:00+00:00

Here is a link to a crowdsourced spreadsheet created to collect information about theory jobs this year. Previously my academic uncle Lance Fortnow set it up (check this link to his post from last year which also has links to all the previous years), but this year he has kindly agreed to try and pass the baton. Rules about the spreadsheet have been copied from last years and all edits to the document are anonymized.

Separate sheets for faculty, industry and postdocs/visitors. New: As suggested by Krzysztof Onak a new tab for sabbaticals was added.
People should be connected to theoretical computer science, broadly defined.
Only add jobs that you are absolutely sure have been offered and accepted. This is not the place for speculation and rumors.
You are welcome to add yourself, or people your department has hired.

This document will continue to grow as more jobs settle.

Theory Jobs 2018 was originally published by Grigory Yaroslavtsev at The Big Data Theory on May 25, 2018.

Center for Algorithms and Machine Learning

2018-03-11T00:00:00+00:00

This Friday we had the official kickoff event for the new Center for Algorithms and Machine Learning here at IU. Huge thanks to my wonderful co-director David Crandall whose wisdom and support have been instrumental in forming CAML!

This has been in the works since the day I accepted an offer from IU almost two years ago (some things at universities take longer than you might expect). Also thanks to our board members, both internal (Dirk van Gucht, Richard Shiffrin, Haixu Tang, Stanley Wasserman) and external (John Langford , Edo Liberty, Vahab Mirrokni , Maxim Sviridenko) for their advice and readiness to serve. Quite a few other people were involved behind the scenes, thanks to everyone!

Among other things we have a postdoc position open.

Photo credits: David Crandall and Michael S. Ryoo.

Center for Algorithms and Machine Learning was originally published by Grigory Yaroslavtsev at The Big Data Theory on March 11, 2018.

Postdoc at the Center for Algorithms and Machine Learning

2018-03-03T00:00:00+00:00

The Center for Algorithms and Machine Learning (CAML, http://caml.indiana.edu) at Indiana University Bloomington invites applications for a postdoctoral researcher position starting Fall 2018. Applicants should have a strong background in theoretical foundations of computing and algorithms for large data, including applications to machine learning and data science. Questions about the position can be directed to Professor Grigory Yaroslavtsev (gyarosla@iu.edu).

The position is initially for one year with a possibility for a renewal for an additional year based on satisfactory job performance and continued funding. The CAML postdoctoral researcher will be located in the new Luddy Hall building and will receive a competitive salary and a comprehensive set of benefits. Bloomington is a vibrant college town known as the “Gateway to Scenic Southern Indiana.” Our campus, located within an hour from Indianapolis, is renowned for its for its music scene and cultural diversity.

Applications should be submitted by email to iu.caml.postdoc@gmail.com with your name as the subject (e.g. “Alan Turing”), and should include a CV, a research statement, and the names of at least 3 references. Please ask your references to send letters to the same address using (“Your Name Their Name Recommendation Letter”) as the subject (e.g. “Alan Turing Alonzo Church Recommendation Letter”). Applicants should also complete the formal application through the IU hiring system here: http://indiana.peopleadmin.com/postings/5583

Application deadline for full consideration (including recommendation letters): March 31. Applications will continue to be considered after the deadline until the position is filled.

Postdoc at the Center for Algorithms and Machine Learning was originally published by Grigory Yaroslavtsev at The Big Data Theory on March 03, 2018.

Workshops in 2018

2018-02-23T00:00:00+00:00

I got invited to talk at quite a few workshops this year and am close to reaching the limit of travel I can handle. Just want to help the organizers advertise these events (most of them on sublinear algorithms and complexity). Please, consider attending and help spread the word!

Workshop on Algorithms for Data Summarization at the University of Warwick, UK (March 19-22). Organized by Graham Cormode and Artur Czumaj.
68th Midwest Theory Day(s) at TTI-Chicago (April 12-13). Organized by Madhur Tulsiani, Aravindan Vijayaraghavan and Anindya De among others.
Workshop on Sublinear Algorithms (June 11-13) and 2nd Workshop on Local Algorithms (June 14-15) at MIT. WoLA is organized by Mohsen Ghaffari, Reut Levi, Moti Medina, Andrea Montanari, Elchanan Mossel and Ronitt Rubinfeld.
Workshop on Interactive Complexity at the Simons Institute, Berkeley (October 15-19). Organized by Kasper Green Larsen, Mark Braverman and Michael Saks.

Hope to see some of you there!

Workshops in 2018 was originally published by Grigory Yaroslavtsev at The Big Data Theory on February 23, 2018.

BeyondMR18 Deadline Approaching

2018-01-29T00:00:00+00:00

The deadline for submissions to BeyondMR’18 (5th Algorithms and Systems for MapReduce and Beyond Workshop) is in about 3 weeks. This workshop will be held in conjunction with SIGMOD/PODS. As a PC member I would personally like to stress the “Beyond” part as both theory and systems have by now gone way further than just MapReduce. Please, consider submitting you work – you will get feedback from a healthy mix of researchers and engineers from both academia and industry.

BeyondMR18 Deadline Approaching was originally published by Grigory Yaroslavtsev at The Big Data Theory on January 29, 2018.

What's New in the Big Data Theory 2017

2018-01-27T00:00:00+00:00

This year I will continue the tradition started last year and summarize a few papers on efficient algorithms for big data that caught my attention last year. Same disclaimers as last year apply and this is by no means supposed to be the list of “best” papers in the field which is quite loosely defined anyway (e.g. I will intentionally avoid deep learning and gradient descent methods here as I am not actively working in these areas myself and there are a lot of resources on these topics already). In particular, this year it was even harder to pick clear favorites so it is even more likely that I have missed some excellent work. Below I will assume familiary with the basics of streaming algorithms and the massively parallel computation model (MPC) discussed in an earlier post.

Before we begin let me quickly plug some of my own work from last year. With my student Adithya Vadapalli we have a new paper ``Massively Parallel Algorithms and Hardness of Single-Linkage Clustering under $\ell_p$-distances’’. As it turns out, while single-linkage clustering and minimum spanning tree problems are the same for exact computation, for vector data round complexity of approximating these two problems in the MPC model is quite different. In another paper I introduce a study of approximate binary linear sketching of valuation functions. This is an extension of our recent study of binary linear sketching to the case when the function of interest should only be computed approximately.

New Massively Parallel Algorithms for Matchings

Search for new algorithms for matchings has lead to development of new algorithmic ideas for many decades (motivating the study of the class P of polynomial-time algorithms) and this year is no exception. Two related papers on matchings caught my attention this year:

“Round Compression for Parallel Matching Algorithms” by Czumaj, Lacki, Madry, Mitrovic, Onak and Sankowski.
“Coresets Meet EDCS: Algorithms for Matching and Vertex Cover on Massive Graphs” by Assadi, Bateni, Bernstein, Mirrokni and Stein.

Both papers are highly technical but achieve similar results. The first paper gives an $O((\log \log |V|)^2)$-round MPC algorithm for the maximum matching problem that uses $O(|V|)$ memory per machine. The second paper improves the number of rounds down to $O(\log \log |V|)$ using slightly larger memory $O(|V| polylog (|V|))$ per machine. Using a standard reduction mentioned in the latter paper both papers can achieve multiplicative $(1+\epsilon)$-approximation for any constant $\epsilon > 0$. These results should be contrasted with the previous work by Lattanzi, Moseley, Suri and Vassilvitskii who give $1/c$-round algorithms at the expense of using $O(|V|^{1 + c})$ memory per machine for any constant $c > 0$. Overall, this is remarkable progress but likely not the end of the story.

Massively Parallel Methods for Dynamic Programming

Dynamic programming, pioneered by Bellman at RAND, is one of the key techniques in algorithm design. Some would even go as far as saying that there are only two algorithmic tecniques and dynamic programming is one of them. However, dynamic programming programming is notoriously sequential and difficult to use for sublinear time/space computation. Most successful stories of speeding up dynamic programming so far have been problem-specific and often highly non-trivial.

In their paper “Efficient Massively Parallel Methods for Dynamic Programming” (STOC’17) Im, Moseley and Sun suggest a fairly generic approach for designing massively parallel dynamic programming algorithms. Three textbook dynamic programming problems can be handled within their framework:

Longest Increasing Subsequence: multiplicative $(1+\epsilon)$-approximation in $O(1/\epsilon^2)$ rounds of MPC.
Optimal Binary Search Tree: multiplicative $(1+\epsilon)$-approximation in $O(1)$ rounds of MPC.
Weighted Interval Scheduling: multiplicative $(1+\epsilon)$-approxiamtion in $O(\log 1/\epsilon)$ rounds of MPC.

On a technical level this paper identifies two key properties that these problems have in common: monotonicity and decmoposability. Montonicity just requires that the answer to a subproblem should always be at most (for maximization)/at least(for minimization) the answer to the problem itself. Decomposability is more subtle and requires that the problem can be decomposed into a two-level recursive family of subproblems where entries of the top level are called groups and entries of the bottom level are called blocks. It should then be possible to 1) construct a nearly optimal solution for the entire problem by concatenating solutions for subproblems, 2) construct a nearly optimal solution for each group from only a constant number of blocks. While monotonicity holds for many standard dynamic problems, decomposability seems much more restrictive so it is interesting to see whether this technique can be extended to some other problems.

See Ben Moseley’s presentation at the Midwest Theory Day for more details.

Randomized Composable Coresets for Matching and Vertex Cover

The simplest massively parallel algorithm one can think of can be described as follows: partition the data between $k$ machines, let each machine select a small subset of the data points, collect these locally selected data points on one central machine and compute the solution there. The hardest part here is the design of the local subset selection procedures. Such subsets are called coresets and have received a lot of attention the study of algorihtms for high-dimensional vectors, see e.g. this survey by Agarwal, Har-Peled and Varadarajan.

Note that in the distributed setting construction of coresets can be affected by the initial distribution of data. In fact, for the maximum mathcings problem non-trivially small coresets can’t be used to approximate the maximum matching up to a reasonable error (see our paper with Assadi, Khanna and Li for the exact statement which in fact rules out not just coresets but any small-space representations) if no assumptions about the distribution of the data is made.

However, if the initial distribution of data is uniformly random then the situation changes quite dramatically. As shown in “Randomized Composable Coresets for Matching and Vertex Cover” by Assadi and Khanna (best paper at SPAA’17) for uniformly distributed data coresets of size $O(|V| polylog |V|)$ can be computed locally and then combined to obtain $O(polylog |V|)$-approximation.

Optimal Lower Bounds for L_p-Samplers, etc

Consider the following problem: a vector $x \in \{0,1\}^n$ (initially consisting of all zeros) is changed by a very long sequence of updates that can flip an arbitrary coordinate of this vector. After seeing this sequence of updates can we retrieve some non-zero entry of this vector without storing all $n$ bits used to represent $x$? Surprisingly, the answer is “yes” and this can be done with only $O(poly(\log n))$ space. If we are required to generate a uniformly random non-zero entry of $x$ then the corresponding problem is called $L_0$-sampling.

$L_0$-sampling turns out to be a remarkably useful primitive in the design of small-space algorithms. Almost all known streaming algorithms for dynamically changing graphs are based on $L_0$-sampling or its relaxation where the uniformity requirement is removed.

While almost optimal upper and lower bounds on the amount of space necessary for $L_0$-sampling have been known since the work of Jowhari, Saglam and Tardos, there were still gaps in terms of the dependence on success probability. If our recovery of a non-zero entry of $x$ has to be successful with probability $1 - \delta$ then the tight bound on space turns out to be $\Theta(\min(n, \log (1/\delta) \log^2 (\frac{n}{\log 1/\delta})))$. This is one of the results of the recent FOCS’17 paper by Kapralov, Nelson, Pahocki, Wang, Woodruff and Yahyazadeh.

Looking forward to more results in 2018!

Please, let me know if there are any other interesting papers that I missed. Also here is a quick shout out goes to some other papers that were close to making the above list:

“Affinity Clustering: Hierarchical Clustering at Scale” by Bateni, Behnezhad, Derakhshan, Hajiaghayi, Kiveris, Lattanzi and Mirrokni (NIPS’17).
“LSH Forest: Practical Algorithms Made Theoretical” by Andoni, Razenshteyn and Shekel Nosatzki (SODA’17).
“Almost Optimal Streaming Algorithms for Coverage Problems” by Bateni, Esfandiari and Mirrokni (SPAA’17).
“Compressed Sensing using Generative Models by Bora, Jalal, Price and Dimakis (ICML’17).
“MST in O(1) Rounds of Congested Clique” by Jurdzinski and Nowicki (SODA’18).

What's New in the Big Data Theory 2017 was originally published by Grigory Yaroslavtsev at The Big Data Theory on January 27, 2018.

Video Recording Screencasts of Talks

2017-08-01T00:00:00+00:00

Here is the setup I came up with for video recording live lectures and talks.

Hardware:

Laptop with a camera: MacBook
Tablet with a pen/stylus: iPad (Pro), MSRP $1K
Microphone: Sennheiser ME 3-EW, MSRP $200
Projector (connected to the laptop)

Software:

Open Broadcasting Software (laptop), free
Microsoft Powerpoint and Office 365 (tablet), free from IU
AirServer (laptop) and AirServer Connect (tablet), $10

Full description and demo on Youtube:

Video Recording Screencasts of Talks was originally published by Grigory Yaroslavtsev at The Big Data Theory on August 01, 2017.

Advice on video recording lectures?

2017-07-31T00:00:00+00:00

Looking for advice: I am thinking of making videos of my class this Fall, the setup I am thinking about is a tablet (just ordered a large iPad Pro) + mic and camera + some software (maybe Google Hangouts live?) to record a screencast of the tablet with my video on the side. Did anyone have experience doing something like this? What hardware and software did you use? If this sounds too clunky to set up on your own, is it worth creating a MOOC and/or having professionals do the recording for you?

Advice on video recording lectures? was originally published by Grigory Yaroslavtsev at The Big Data Theory on July 31, 2017.

Theory Jobs 2017

2017-06-08T00:00:00+00:00

UPD: Lance created his Theory Jobs spreadsheet, I’ve moved all the information there and changed the link below to Lance’s spreadsheet.

Here is a link to a crowdsourced spreadsheet created by Lance Fortnow that collects information about theory jobs this year. Previously Lance set it up, but this year it is getting late in the year so I decided to go ahead and create one myself. In previous years the jobs post was up a few weeks back so I hope I am not jumping the gun here. Rules about the spreadsheet have been copied from Lance's last year post and all edits to the document are anonymized.

Separate sheets for faculty, industry and postdoc/visitors.
People should be connected to theoretical computer science, broadly defined.
Only add jobs that you are absolutely sure have been offered and accepted. This is not the place for speculation and rumors.
You are welcome to add yourself, or people your department has hired.

This document will continue to grow as more jobs settle.

Theory Jobs 2017 was originally published by Grigory Yaroslavtsev at The Big Data Theory on June 08, 2017.

67th Midwest Theory Day

2017-04-30T00:00:00+00:00

67th Midwest Theory ~~Day~~ Weekend took place two weeks ago here at Indiana University, Bloomington organized by Qin Zhang, Yuan Zhou and myself. We were lucky to have a flurry of fantastic speakers from the Midwest and three external headliners: Alex Andoni, Vahab Mirrokni and Ryan O’Donnell. Slides are now posted online.

As organizers we’ve decided to experiment with the format of this MTD making it a 2-day event. Based on feedback we received we believe that this format has worked out perfectly. For a geographically spread out Midwest area travel logistics makes overheads of attending a one-day event often too much for many of those interested to come.

67th Midwest Theory Day was originally published by Grigory Yaroslavtsev at The Big Data Theory on April 30, 2017.

What's New in the Big Data Theory 2016

2016-12-30T00:00:00+00:00

This post will give an overview of papers on theory of algorithms for big data that caught my attention in 2016. The basic rule that I used when making the list was whether I can see these results being included into some of the advanced graduate classes on algorithms in the future. Also, while I obviously can’t include my own results here, among my own 2016 papers my two personal favorites are tight bounds on space complexity of computing approximate matchings in dynamic streams (with S. Assadi, S. Khanna and Y. Li) and the $\mathbb F_2$-sketching paper (with S. Kannan and E. Mossel and some special credit to Swagato Sanyal who subsequently improved the dependence on error in one of our main theorems).

It’s been a great year with several open problems resolved, old algorithms improved and new lines of research started. All papers discussed below are presented in no particular order and their selection is clearly somewhat biased towards my own research interests.

Maximum Weighted Matching in Semi-Streaming

Sweeping both the best paper and the best student paper awards at the upcoming 28th ACM Symposium on Discrete Algorithms is a paper on semi-streaming algorithms for maximum weighted matching by graduate students Ami Paz and Gregory Schwartzman. In semi-streaming we are given one pass over edges of an $n$-vertex and only $\tilde O(n)$ bits of space. It is easy to get a 2-approximation to the maximum matching by just maintaining the maximal matching of the graph. However, for weighted graphs maximal matching no longer guarantees a 2-approximation.

A long line of work has previously given constant factor approximations for this problem and finally we have a $2+\epsilon$-approixmation. It is achieved via a careful implementation of the primal-dual algorithm for matchings in the semi-streaming setting. It may seem somewhat surprising that primal-dual hasn’t been applied to this problem before since in the area of approximation algorithms it is a pretty standard way of reducing weighted problems to their unweighted versions, but the exact details of how to implement primal-dual in the streaming setting are quite delicate. I couldn’t find a version of this paper online so the best bet might be to wait for the SODA proceedings.

Now the big open question is whether one can beat the 2-approximation which is open even in the unweighted case.

Shuffles and Circuits

Best paper award at the 28th ACM Symposium on Parallelism in Algorithms and Architectures went to ‘‘Shuffles and Circuits’’, a paper by Roughgarden, Vassilvitskii and Wang. This paper emphasizes the difference between rounds of MapReduce and depth of a circuit. Because some of the machines can choose to stay silent between the rounds a round of MapReduce can be more complex than a layer of a circuit as the machines sending input to the next round might depend on the original input data. The paper shows that nevertheless the standard circuit complexity ‘‘degree bound’’ can be applied to MapReduce computation. I.e. any Boolean function whose polynomial representation has degree $d$ requires $\Omega(\log_s d)$ rounds of MapReduce using machines with space $s$. This implies an $\Omega(\log_s n)$ lower bound on the number of rounds for computing connectivity of a graph. The authors also make explicit a connection between the MapReduce model and $NC^1$ (see definition here) which implies that improving lower bounds beyond $\log_s n$ for polynomially many machines would imply separating $P$ from $NC^1$.

Beating Counting Sketches for Insertion-Only Streams

Both CountSketch and Count-Min Sketch, which are textbook approximate data structures for storing very large dynamically changing numerical tables in small space, have been improved this year under the assumption that data in the table is only incremented. These improvements are for the most common application of such sketches to ``heavy hitters’’– the task of recovering largest entries from the table approximately. For CountSketch see the paper by Braverman, Chestnut, Ivkin, Woodruff from STOC’16 and for CountMin Sketch the paper by Bhattacharyya, Dey and Woodruff from PODS’16.

Optimality of the Johnson-Lindenstrauss Transform

Two papers by Green Larsen and Nelson and by Alon and Klartag have resolved the question of proving optimality of the Johnson-Lindenstrauss transform. Based on doing a projection on random low-dimensional subspace JL-transform is the main theoretical tool for dimensionality reduction of high-dimensional vectors. As these papers show no low-dimensional embedding and furthermore no data structure can achieve better bit complexity than $\Theta(n \log n/\epsilon^2)$ for $(1 \pm \epsilon)$-approximating all pairwise distances between $n$ vectors in Euclidean space (for a certain regime of parameters). This matches the Johnson and Lindenstrauss upper bound and improves an old lower bound of $\Omega\left(\frac{n \log n}{ \epsilon^2 \log 1/\epsilon}\right)$ due to Alon. Even though Alon’s argument is significantly simpler getting an optimal lower bound is a very nice achievement.

Fast Algorithm for Edit Distance if It's Small

Edit distance is one of the cornerstone metrics of text similairity in computer science. It can be computed in quadratic time using standard dynamic programming which is optimal assuming SETH due to the result of Backurs and Indyk. Edit distance also has a number of applications including comparing DNAs in computational biology. In these applications it is usually reasonable to assume that edit distance is only interesting if it is not too large. Unfortunately, this doesn’t help speed up the standard dynamic program. A series of papers, including two papers from this year by Chakraborty, Goldenberg and Koucky (STOC’16) and Belazzogui and Zhang lead to the following result: sketches of size $poly(K \log n)$ bits suffice for computing edit distance $\le K$. Such sketches can be applied not just in centralized but also in distributed and streaming settings making it possible to compress input strings down to size that (up to logarithmic factors) only depends on $K$.

Tight Bounds for Set Cover in Streaming

Set Cover is a surprisingly powerful abstraction for a lot of applications that involve providing coverage for some set of terminals. Given a collection of sets $S_1, \dots, S_m \subseteq [n]$ the goal is to find the smallest cardinality subcollection of these sets such that their union is $[n]$, i.e. all of the underlying elements are covered. In approximation algorithms a celebrated greedy algorithm gives an $O(\log n)$-approximation for this problem. In streaming there has been a lot of interest lately in approximating classic combinatorial optimization problems in small space with Set Cover being one of the main examples. For an overview from last year check Piotr Indyk’s talk from the DIMACS Workshop on Big Data and Sublinear Algorithms.

As this STOC’16 paper by Assadi, Khanna and Li shows savings in space for streaming Set Cover can only be proportional to the loss in approximation. In particular, if we are interested in computing Set Cover which is within a multiplicative factor $\alpha$ of the optimum then: 1) for computing the cover itself space $\tilde \Theta(mn/\alpha)$ is necessary and sufficient, 2) for just esimating the size space $\tilde \Theta(mn/\alpha^2)$ is necessary and sufficient.

Polynomial Lower Bound for Monotonicity Testing

Finally a polynomial lower bound has been shown for adaptive algorithms for testing monotonicity of Boolean functions $f \colon \{0,1\}^n \rightarrow \{0,1\}$. The lower bound implies that any algorithm that can tell whether $f$ is monotone or differs from monotone on a constant fraction of inputs has to query at least $\tilde \Omega(n^{1/4})$ values of $f$. This result is due to Belovs and Blais (STOC’16) and is in contrast with the upper bound of $\tilde O(\sqrt{n})$ by Khot, Minzer and Safra from last year’s FOCS. Probably the biggest result in property testing this year.

Linear Hashing is Awesome

While ‘‘Linear Hashing is Awesome’’ by Mathias Bæk Tejs Knudsen doesn’t fall into the traditional ‘‘sublinear algorithms for big data’’ category this paper still has some sublinear flavor because of its focus on very fast query times. Linear hashing is a classic hashing scheme $h(x) = ((ax + b) \mod p) \mod m$ where $a,b$ are random. It is very often used in practice and discussed extensively in CLRS. This paper proves that linear hashing ~~is awesome~~ results in expected length of the longest chain of only $O(n^{1/3})$ compared to the previous simple bound of $O(\sqrt{n})$.

Finally, this paper also decisively wins my ‘‘Best Paper Title 2016’’ award.

Looking forward to more cool results in 2017!

There has been a lot of great results in 2016 and it’s hard to mention all of them in one post and I certainly might have missed some exciting papers. Here is a quick shout out to some other papers that were close to making the above list:

Tight Bounds for Data-Dependent LSH by Andoni and Razenshteyn from SoCG'16.
Optimal Quantile Estimation in Streams by Karnin, Lang and Liberty from FOCS'16.

Happy 2017!

What's New in the Big Data Theory 2016 was originally published by Grigory Yaroslavtsev at The Big Data Theory on December 30, 2016.

The Binary Sketchman

2016-10-07T00:00:00+00:00

In this post I will talk about some of my recent work with Sampath Kannan and Elchanan Mossel on linear methods for binary data compression. The paper is available here, slides from my talk at Penn are here and another talk at Columbia is coming up on Nov 21.

Given very large data represented in binary format as a string of length $n$, i.e. $x \in \{0,1\}^n$ we are interested in a compression algorithm that can transform $x$ into a much shorter binary string $y \in \{0,1\}^k$. Here $k \ll n$ so that we can achieve some non-trivial savings in space. Moreover, if $x$ changes in the future we would like to be able to update our compressed version of it (without having to store the original $x$).

Clearly compression introduces some loss making it impossible to recover certain properties of the original data from the compressed string. However, if we know in advance which property of $x$ we are interested in then efficient compression often becomes possible. We will model the property of interest as a binary function $f:\{0,1\}^n \rightarrow \{-1,1\}$ which labels all possible $x$’s with two labels. So our goal will be to be able to: 1) perform this binary classification, i.e. compute $f(x)$ using compressed data $y$ only, 2) do this even if $x$ changes over time – updates for us will be bit flips in the coordinates of $x$ specified by the index of the bit that is getting flipped.

Finally, if $x$ is so big that it can’t be stored locally and has to be divided into chunks stored across multiple machines then we will be able to compress the chunks locally and then combine them on a central server into a compressed version of the entire data – one simple round of MapReduce or whatever your favorite distributed framework is.

To make the above discussion less abstract let’s consider a machine learning application – evaluating a linear classifier over binary data. Let’s say we have trained a linear classifier of the form $sign(\sum_{i = 1}^n w_i x_i - \theta)$ where sign is the sign function. Is it possible to compress $x$ in such a way that we can still evaluate our classifier in the scenarios described above? Turns out we can compress the input down to $O(\theta/m \log (\theta/m))$ bits where $m$ is a parameter of the linear classifier known as its margin. Moreover, no compression scheme can do better.

Introducing the Binary Sketchman

While the setting described above may seem quite challenging it can be handled through a framework of linear sketching. In the binary case the interpretation of linear sketching is particularly simple as our binary sketchman is just going to compute $k$ parities of the bits of $x$, say for $k=3$:

\[x_4 \oplus x_2, \quad x_{42}, \quad x_{566} \oplus x_{610} \oplus x_{239} \oplus x_{57}.\]

In a matrix form this corresponds to computing $Mx$ where $M$ is a $k \times n$ binary matrix and the operations are performed over $\mathbb F_2$. Note that now our sketch easily satisfies all the requirement above since as $x$ changes we can just update the corresponding parities. In the distributed case we can compute them locally and then add up on a central server.

Unfortunately the power of a deterministic sketchman who just uses a fixed set of parities is quite limited and no such sketchman can compress even a simple linear classifier down to less than $n$ bits. In fact, even for the OR function $f = x_1 \vee x_2 \vee \dots \vee x_n$ no deterministic sketch can have less than $n$ bits. So our binary sketchman will “unleash the power of randomization” in his quest for a perfect sketch. According to Bernhard Haeupler this can be quite dramatic and looks kind of like this:

So our sketchman will instead pick the matrix $M$ randomly while the rest is the same as before. Now the OR function is easy to handle: pick a parity over a random subset of $\{1, \dots, n\}$ where each coordinate is included with probability $1/2$. If $OR(x) = 1$ then this parity catches a non-zero coordinate of $x$ with probability $1/2$ and thus evaluates to $1$ with probability at least $1/4$. If $OR(x) = 0$ then the parity never evaluates to $1$ so we can distinguish the two cases with probability $1 - \delta$ using $O(\log 1/\delta)$ such parities. This illustrates a more general idea – if $f$ is a constant function on all but $m$ different inputs then a sketch of size $O(\log m + \log 1/\delta)$ suffices.

Now for linear thresholds the high-level ideas behind this sketching process are as follows: 1) observe that any linear threshold function takes the same value on all but $n^{O(\theta/m)}$ inputs, 2) apply the same argument as above to obtain a sketch of size $O(\theta/m \log n + \log 1/\delta)$. The only thing missing in the above argument is that we still have dependence on $n$. This can be avoided if we first hash the domain reducing its size down to $n' = poly(\theta/m)$ which replaces $n$ in the above calculations giving us $O(\theta/m \log \theta/m + \log 1/\delta)$. While this compression method is quite simple the remarkable fact is that it can’t be improved. Even for the simplest threshold function that corresponds to a threshold for the Hamming weight of $x$, i.e. $sign(\sum_{i = 1}^n x_i - k)$, any compression mechanism would require $\Omega(k \log k)$ bits as follows from this work by Dasgupta, Kumar and Sivakumar. Note that it isn’t assumed that the protocol is based on linear sketching – it can be an arbitrary scheme.

The Power of Randomized Binary Sketchman

Linear sketching by itself is not a new idea and has been studied extensively in the last two decades. See surveys by Woodruff and McGregor on how it can be applied to problems in numerical linear algebra and graph compression. However, this work focuses on linear sketching over large finite fields (used to represent real values with bounded precision). Nevertheless some striking results are known about linear sketching that are applicable in our context as well. In particular, if $x$ is updated through a very long (triply exponential in $n$) stream of adversarial updates then linear sketches over finite fields are optimal for any function $f$ as shown by Li, Nguyen and Woodruff here in STOC’14.

As our paper shows the same result holds for much shorter random streams of length $\tilde O(n)$ in a simple model where each update flips uniformly at random chosen coordinate of $x$. In other words binary sketching is optimal if in the end of the stream the input $x$ is uniformly distributed. The proof of this fact is quite technical and relies on a notion of approximate Fourier dimension for Boolean functions that we use to characterize binary sketching under the uniform distribution – check the paper for details if you are interested. Whether the same result holds for short (length $\tilde O(n)$, say) adversarial streams is the main open question left open.

The Binary Sketchman was originally published by Grigory Yaroslavtsev at The Big Data Theory on October 07, 2016.

Teaching Foundations of Data Science

2016-08-27T00:00:00+00:00

This week I started teaching a graduate class called “Foundations of Data Science” that will be mostly based on an eponymous book by Avrim Blum, John Hopcroft and Ravi Kannan. The book is still a draft and I am using this version. Target audience includes advanced undergraduate and graduate level students. We had some success using this book as a core material for an undergraduate class at Penn this Spring (link to the news article). The draft has been around for a while and in fact I ran a reading group that used it four years back when I was in grad school and the book was called “Computer Science Theory for the Information Age”.

“Data Science” is one of those buzzwords that can mean very different things to different people. In particular, a new graduate Masters program in Data Science here at IU attracts hundreds of students from diverse backgrounds. What I personally really like about the Blum-Hopcroft-Kannan book is that it doesn’t go into any philosophy about the meaning of data science but rather offers a collection of mathematical tools and topics that can be considered as foundational for data science as seen from computer science perspective. It should be noted that just as any “Foundations of Computing” class has little to do with finding bugs in your code so do this class and book have little to do with data cleaning and other data analysis routine.

Topics

While the jury is still out on what topics should be considered as fundamental for data science I think that the Blum-Hopcroft-Kannan book makes a good first step in this direction.

Let’s look at the table of contents:

Chapter 2 introduces basic properties of the high-dimensional space, focusing on concentration of measure, properties of high-dimensional Gaussians and basic dimension reduction.
Chapter 3 covers the Singular Value Decomposition (SVD) and its applications (principal component analysis, clustering mixture of Gaussians, etc.).
Chapter 4 focuses on random graphs (primarily in the Erdos-Renyi model).
Chapter 5 introduces random walks and Markov chains, including Markov Chain Monte Carlo methods, random works on graphs and applications such as Page Rank.
Chapter 6 covers the very basics of machine learning theory, including learning basic function classes, perceptron algorithm, regularization, kernelization, support vector machines, VC-dimension bounds, boosting, stochastic gradient descent and a bunch of other topics.
Chapter 7 describes a couple of streaming and sampling methods for big data: frequency moments in streaming and matrix sampling.
Chapter 8 is about clustering methods: k-means, k-center, spectral clustring, cut-based clustering, etc.
Chapters 9 through 11 cover a very diverse set of topics that includes hidden Markov processes, graphical models, belief propagation, topic models, voting systems, compressed sensing, optimization methods and wavelets among others.

Discussion

Overall this looks like a good stab at the subject and a big advantage of this book is that unlike some of its competitors it treats its topics with mathematical rigor. The only chapter that I personally don’t really see fit into a “data science” class is Chapter 4. Because of its focus on the Erdos-Renyi model that I haven’t seen being used realistically for graph modeling applications this chapter seems to be mostly of purely mathematical interest.

Selection of some of the smaller topics is a matter of personal taste, especially when it comes to those that are missing. A couple of quick suggestions is to cover new sketching algorithms for high-dimensional linear regression, locality-sensitive hashing and possibly Sparse FFT.

Slides will be posted here and I will write a report on the final selection of topics and my experience in the end of semester. Stay tuned :)

Teaching Foundations of Data Science was originally published by Grigory Yaroslavtsev at The Big Data Theory on August 27, 2016.

The Big Data Theory

Theory Jobs 2024

Theory Jobs 2023

Theory Jobs 2022

Theory Jobs 2021

Theory Jobs 2020

How I Spent Last Summer FAQ

Summer Interns in Bloomington

London and the Alan Turing Institute

Theory Jobs 2019

Theory Jobs 2018

Center for Algorithms and Machine Learning

Postdoc at the Center for Algorithms and Machine Learning

Workshops in 2018

BeyondMR18 Deadline Approaching

What's New in the Big Data Theory 2017

New Massively Parallel Algorithms for Matchings

Massively Parallel Methods for Dynamic Programming

Randomized Composable Coresets for Matching and Vertex Cover

Optimal Lower Bounds for Lp-Samplers, etc

Looking forward to more results in 2018!

Video Recording Screencasts of Talks

Advice on video recording lectures?

Theory Jobs 2017

67th Midwest Theory Day

What's New in the Big Data Theory 2016

Maximum Weighted Matching in Semi-Streaming

Shuffles and Circuits

Beating Counting Sketches for Insertion-Only Streams

Optimality of the Johnson-Lindenstrauss Transform

Fast Algorithm for Edit Distance if It's Small

Tight Bounds for Set Cover in Streaming

Polynomial Lower Bound for Monotonicity Testing

Linear Hashing is Awesome

Looking forward to more cool results in 2017!

The Binary Sketchman

Introducing the Binary Sketchman

The Power of Randomized Binary Sketchman

Teaching Foundations of Data Science

Topics

Discussion

Optimal Lower Bounds for L_p-Samplers, etc