The Big Data TheoryJekyll2020-06-15T04:23:19+00:00http://grigory.github.io/blog/Grigory Yaroslavtsevhttp://grigory.github.io/blog/grigory@grigory.us<![CDATA[Theory Jobs 2020]]>http://grigory.github.io/blog/theory-jobs-20202020-06-14T00:00:00+00:002020-06-14T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<p>It’s been an unusually challenging year for both sides of the TCS job market with some unexpected obstacles and delays. Apologies for putting up the spreadsheet later than usual and congrats to both sides in each converged process!</p>
<p><a href="https://docs.google.com/spreadsheets/d/1kzq4xVyU1k5CUTrV0yjIgzqlcv8agZqN_jiVlbYJb9g/edit?usp=sharing">Here is a link</a> to a crowdsourced spreadsheet created to collect information about theory jobs this year.
I put in a biased pseudorandom seed, please help populate and share!
Rules for the spreadsheet have been copied from previous years (with one substantial suggestion regarding senior hires based on one of my friends’ recommendation, see below) and all edits to the document are anonymized. Please, post a comment if you have any suggestions about the rules.</p>
<ul>
<li>Separate sheets for faculty, industry and postdocs/visitors. </li>
<li>People should be connected to theoretical computer science, broadly defined.</li>
<li>Only add jobs that you are absolutely sure have been offered and accepted. This is not the place for speculation and rumors. <b>New:</b> Please, be particularly careful when adding senior hires (people who already have an academic or industrial job) -- end dates of their current positions might be still in the future. </li>
<li>You are welcome to add yourself, or people your department has hired. </li>
</ul>
<p><a href="http://grigory.github.io/blog/theory-jobs-2020/">Theory Jobs 2020</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on June 14, 2020.</p><![CDATA[How I Spent Last Summer FAQ]]>http://grigory.github.io/blog/how-i-spent-last-summer2019-10-05T00:00:00+00:002019-10-05T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<!--
<div align="center"><img alt="CAML" src="http://grigory.github.io/blog/pics/summer.png" width="400"></div>
-->
<p>I get a lot of questions about how I spent last summer. Normally I just take off to the Bay Area the day my last Spring class is over and fly back the day before my Fall class begins.
However, last summer I decided I’ve been in the US long enough to learn everything it has to offer and it was time to explore life across the pond and spend three months at the Alan Turing Institute in London. Then I had two interns coming over to Bloomington so I spent my first ever summer month here.
Since it is that time of year, a quick reminder to <a href="https://cs.indiana.edu/apply/graduate-application.html">apply by Dec 15</a> if you are interested in doing a Ph.D. and stay tuned for the internship call announcement (probably similar deadline).</p>
<h1 id="summer-interns-in-bloomington">Summer Interns in Bloomington</h1>
<p>IU has started a <a href="https://sice.indiana.edu/research/student-research/fellowship.html">Global Talent Attraction Program (GTAP)</a> – fantastic program for international summer interns. The program gives you a $4000 stipend and you spend 2 months here at IU. There were a lot of strong applicants so it took me a while to interview all candidates. In the end, the two interns I got were <a href="https://codeforces.com/profile/Chameleon2460">Jakub Boguta</a> (U. Warsaw, ACM ICPC gold this year, must be tough to be in the lead for 4 hours and not win) and <a href="https://codeforces.com/profile/josdas">Stanislav Naumov</a> (SPb ITMO, ACM ICPC finalist, who spent summer at Google and just arrived on campus). Also, <a href="https://www.eleves.ens.fr/home/farthaud/">Farid Arthaud</a> joined us from ENS Paris, Ulm with a short recommendation of being “probably the best third-year CS student in France”. If you think you are the best in your country, have U.S. citizenship and don’t need to get paid, shoot me an email ;) Despite it being hot and humid here in Bloomington during the summer, we had a great time.</p>
<div align="center"><img alt="interns" src="http://grigory.github.io/blog/pics/interns.jpg" width="400" /></div>
<p><br /></p>
<p>We decided to dive into deep learning for image classification and figured out how to get more mileage out of standard pretrained neural nets by using them to produce hierarchical clusterings (with guarantees). If this sounds fun, you can apply for GTAP next year (picture by Farid).</p>
<div align="center"><img alt="hc" src="http://grigory.github.io/blog/pics/hc.png" width="400" /></div>
<h1 id="london-and-the-alan-turing-institute">London and the Alan Turing Institute</h1>
<p>Overall, this was a great experience as it quickly became clear that my neural net is overfit to the US lifestyle. I think of UK as throwing in some perturbations to your visual and verbal input (some may seem adversarial, but mostly just random) which, as we know, is good for robustness, generalization and what not.</p>
<ul>
<li><b>Q</b>: Is grass greener there? <b>A</b>: Yes, of course. Especially, if you live next to the Regent’s Park.</li>
<li><b>Q</b>: Is it your cup of tea? <b>A</b>: No, I still only function on Redbull, but the afternoon teas are a great experience. Proximity to cutting-edge tech, CS research and startups still matters most to me. However, if you are into math or finance, your mileage will almost certainly vary. Also, London seems perfect for a short-term visit/sabbatical, especially if you want to take a break from the tech hype, write a book, explore Europe, etc.</li>
<li><b>Q</b>: What’s up with the <a href="https://www.turing.ac.uk/">Alan Turing Institute</a> and DeepMind? <b>A</b>: These two are probably the most happening places in the UK right now in academia and industry respectively. They are within a 5-minute walk from each other in King’s Cross. I was staying right across the road and it was perfect except for no AC. ATI serves as a meeting hub for researchers from all of the top UK schools (Cambridge, Oxford, Warwick, UCL, Edinburgh, etc.).
ATI is based inside the British library, which was the largest public building constructed in the UK in the 20th century. ATI has its own space inside the library which is equipped similarly to Google/FB offices. Except no free food, only drinks – would you want to have free British food anyway?</li>
</ul>
<div align="center"><img alt="ati" src="http://grigory.github.io/blog/pics/ati.jpg" width="400" /></div>
<p><br /></p>
<div align="center"><img alt="bl" src="http://grigory.github.io/blog/pics/bl.jpg" width="400" /></div>
<ul>
<li><b>Q</b>: Is Shoreditch the most hip neighborhood? <b>A</b>: I think so, best Sci-Fi graffiti ever.</li>
</ul>
<div align="center"><img alt="murals" src="http://grigory.github.io/blog/pics/mural.jpg" width="400" /></div>
<ul>
<li><b>Q</b>: Did you meet the King? <b>A</b>: Yes, in Heathrow I ran into a 250-pound dude from Atlanta who made it quite clear that’s him by wearing one of these (except in a larger font and in dirty red color).</li>
</ul>
<div align="center"><img alt="king" src="http://grigory.github.io/blog/pics/king.jpg" width="400" /></div>
<ul>
<li><b>Q</b>: <a href="https://www.youtube.com/watch?v=ViHsfeXNgjY">Is Paris still Paris?</a> <b>A</b>: I think so (my third time). K and I took a 2-hour train down there directly from King’s Cross (St. Pancras station, another reason to stay in King’s Cross). We’ve enjoyed our time greatly, especially in Versailles and ENS Paris, Ulm. The Salvador Dali Museum in Montmartre was another highlight of this trip.</li>
</ul>
<div align="center"><img alt="versailles" src="http://grigory.github.io/blog/pics/versailles.jpg" width="400" /></div>
<ul>
<li><b>Q</b>: Brexit, Boris Johnson? <b>A</b>: Locals made fun of me for having never heard of Boris Johnson. <a href="https://www.youtube.com/watch?v=dXyO_MC9g3k">Is there much to know anyway</a>?</li>
</ul>
<p><a href="http://grigory.github.io/blog/how-i-spent-last-summer/">How I Spent Last Summer FAQ</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on October 05, 2019.</p><![CDATA[Theory Jobs 2019]]>http://grigory.github.io/blog/theory-jobs-20192019-05-30T00:00:00+00:002019-05-30T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<p><img src="http://grigory.github.io/blog/pics/theory-jobs-2019.png" />
Apparently, it’s a busy life being an assistant prof so there were no posts here all year. However, while some of us are decompressing after the NeurIPS deadline, <a href="https://docs.google.com/spreadsheets/d/1Oegc0quwv2PqoR_pzZlUIrPw4rFsZ4FKoKkUvmLBTHM/edit?usp=sharing">here is a link</a> to a crowdsourced spreadsheet created to collect information about theory jobs this year.
Congratulations to both job seekers and departments/labs who are done with their searches!</p>
<p>In the past my academic uncle Lance Fortnow set this spreadsheet up (check <a href="https://blog.computationalcomplexity.org/2017/06/theory-jobs-2016.html">this link</a> to his post from two years ago which also has links to all the previous years). This year the first entry is Lance himself who is moving back to Chicago to be the Dean of the College of Science at the Illinois Institute of Technology. Did Lance get the idea from his advisor <a href="https://en.wikipedia.org/wiki/Michael_Sipser">Michael Sipser</a> who is also a Dean of Science but at MIT? In any case, great to see theoretical computer scientists stepping up to be the deans of science, congratulations!</p>
<p>Rules about the spreadsheet have been copied from last years and all edits to the document are anonymized. Please, post a comment if you have any suggestions about the rules.</p>
<ul>
<li>Separate sheets for faculty, industry and postdocs/visitors. </li>
<li>People should be connected to theoretical computer science, broadly defined.</li>
<li>Only add jobs that you are absolutely sure have been offered and accepted. This is not the place for speculation and rumors. </li>
<li>You are welcome to add yourself, or people your department has hired. </li>
</ul>
<p>This document will continue to grow as more jobs settle.</p>
<iframe width="100%" height="100%" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vQH_Z_7vZDsRkiMPowtfImjJ4kuJXG3mA3tzLbveIBmTQG5EBNyJt7eH6XeDQMaGK4edu2hNmPmB0hg/pubhtml?widget=true&headers=false"></iframe>
<p><a href="http://grigory.github.io/blog/theory-jobs-2019/">Theory Jobs 2019</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on May 30, 2019.</p><![CDATA[Theory Jobs 2018]]>http://grigory.github.io/blog/theory-jobs-20182018-05-25T00:00:00+00:002018-05-25T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<p><img src="http://grigory.github.io/blog/pics/theory-jobs-2018.png" />
<a href="https://docs.google.com/spreadsheets/d/1P5okKjeNlvkEEFMzX3l8VcL4SHpWBKNFET-5mPTWfDQ/edit#gid=0">Here is a link</a> to a crowdsourced spreadsheet created to collect information about theory jobs this year. Previously my academic uncle Lance Fortnow set it up (check <a href="https://blog.computationalcomplexity.org/2017/06/theory-jobs-2016.html">this link</a> to his post from last year which also has links to all the previous years), but this year he has kindly agreed to try and pass the baton. Rules about the spreadsheet have been copied from last years and all edits to the document are anonymized.</p>
<ul>
<li>Separate sheets for faculty, industry and postdocs/visitors. <b>New:</b> As suggested by <a href="http://onak.pl">Krzysztof Onak</a> a new tab for sabbaticals was added.</li>
<li>People should be connected to theoretical computer science, broadly defined.</li>
<li>Only add jobs that you are absolutely sure have been offered and accepted. This is not the place for speculation and rumors. </li>
<li>You are welcome to add yourself, or people your department has hired. </li>
</ul>
<p>This document will continue to grow as more jobs settle.</p>
<iframe width="100%" height="100%" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vQjB0ACZDBHFnKc4_OZPqtpMCdp3VbmtIQjlSvZtycgzlIQ0DOUoNcYBgy_fQccMjczZk1iAwhTTLTn/pubhtml?widget=true&headers=false"></iframe>
<p><a href="http://grigory.github.io/blog/theory-jobs-2018/">Theory Jobs 2018</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on May 25, 2018.</p><![CDATA[Center for Algorithms and Machine Learning]]>http://grigory.github.io/blog/caml2018-03-11T00:00:00+00:002018-03-11T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<div align="center"><img alt="CAML" src="http://grigory.github.io/blog/pics/caml.png" width="400" /></div>
<p>This Friday we had the official kickoff event for the new <a href="http://caml.indiana.edu">Center for Algorithms and Machine Learning</a> here at IU.
Huge thanks to my wonderful co-director <a href="https://www.cs.indiana.edu/~djcran/">David Crandall</a> whose wisdom and support have been instrumental in forming CAML!</p>
<p>This has been in the works since the day I accepted an offer from IU almost two years ago (some things at universities take longer than you might expect).
Also thanks to our board members, both internal
(<a href=" https://www.cs.indiana.edu/~vgucht/">Dirk van Gucht</a>,
<a href=" http://shiffrin.cogs.indiana.edu/">Richard Shiffrin</a>,
<a href=" https://www.informatics.indiana.edu/hatang/">Haixu Tang</a>,
<a href=" http://mypage.iu.edu/~stanwass/">Stanley Wasserman</a>) and external (<a href=" http://hunch.net/~jl/">John Langford<a>,
<a href=" https://edoliberty.github.io/">Edo Liberty</a>,
<a href=" https://people.csail.mit.edu/mirrokni/Welcome.html">Vahab Mirrokni<a>,
<a href=" https://www.linkedin.com/in/maxim-sviridenko-44822991/">Maxim Sviridenko</a>)
for their advice and readiness to serve.
Quite a few other people were involved behind the scenes, thanks to everyone!</a></a></a></a></p>
<p>Among other things we have a <a href="http://grigory.us/blog/caml-postdoc-18/">postdoc position</a> open.</p>
<p>Photo credits: <a href="https://www.cs.indiana.edu/~djcran/">David Crandall</a> and <a href="http://michaelryoo.com/">Michael S. Ryoo</a>.
<br /></p>
<div align="center"><img alt="CAML" src="http://grigory.github.io/blog/pics/caml-kickoff.jpeg" /></div>
<p><img alt="CAML" src="http://grigory.github.io/blog/pics/pic4.jpg" style="float:left" width="50%" /><img alt="CAML" src="http://grigory.github.io/blog/pics/pic6.jpg" style="float:left" width="50%" />
<img alt="CAML" width="50%" style="float:left" src="http://grigory.github.io/blog/pics/pic3.jpg" /> <img alt="CAML" src="http://grigory.github.io/blog/pics/pic7.jpg" style="float:left" width="50%" />
<img alt="CAML" src="http://grigory.github.io/blog/pics/pic5.jpg" style="float:left" width="50%" /><img alt="CAML" src="http://grigory.github.io/blog/pics/pic2.jpg" style="float:left" width="50%" />
<img alt="CAML" src="http://grigory.github.io/blog/pics/pic8.jpg" style="float:left" width="100%" /></p>
<p><a href="http://grigory.github.io/blog/caml/">Center for Algorithms and Machine Learning</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on March 11, 2018.</p><![CDATA[Postdoc at the Center for Algorithms and Machine Learning]]>http://grigory.github.io/blog/caml-postdoc-182018-03-03T00:00:00+00:002018-03-03T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<div align="center"><img alt="Postdoc@CAML" src="http://grigory.github.io/blog/pics/postdoc-caml.png" /> </div>
<p>The Center for Algorithms and Machine Learning (CAML, <a href="http://caml.indiana.edu">http://caml.indiana.edu</a>) at Indiana University Bloomington invites applications for a postdoctoral researcher position starting Fall 2018. Applicants should have a strong background in theoretical foundations of computing and algorithms for large data, including applications to machine learning and data science. Questions about the position can be directed to Professor Grigory Yaroslavtsev (<a href="mailto:gyarosla@iu.edu">gyarosla@iu.edu</a>).</p>
<p><br /></p>
<p>The position is initially for one year with a possibility for a renewal for an additional year based on satisfactory job performance and continued funding. The CAML postdoctoral researcher will be located in <a href=" https://www.sice.indiana.edu/about/luddy-hall.html">the new Luddy Hall building</a> and will receive a competitive salary and a comprehensive set of benefits. Bloomington is a vibrant college town <a href="https://en.wikipedia.org/wiki/Bloomington,_Indiana">known as</a> the “Gateway to Scenic Southern Indiana.” Our campus, located within an hour from Indianapolis, is renowned for its for its music scene and cultural diversity.</p>
<p><br /></p>
<p>Applications should be submitted by email to iu.caml.postdoc@gmail.com with your name as the subject (e.g. “Alan Turing”), and should include a CV, a research statement, and the names of at least 3 references. Please ask your references to send letters to the same address using (“Your Name Their Name Recommendation Letter”) as the subject (e.g. “Alan Turing Alonzo Church Recommendation Letter”). Applicants should also complete the formal application through the IU hiring system here: <a href="http://indiana.peopleadmin.com/postings/5583">http://indiana.peopleadmin.com/postings/5583</a></p>
<p><br /></p>
<p>Application deadline for full consideration (including recommendation letters): <b>March 31</b>. Applications will continue to be considered after the deadline until the position is filled.</p>
<p><a href="http://grigory.github.io/blog/caml-postdoc-18/">Postdoc at the Center for Algorithms and Machine Learning</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on March 03, 2018.</p><![CDATA[Workshops in 2018]]>http://grigory.github.io/blog/workshops-20182018-02-23T00:00:00+00:002018-02-23T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<p>I got invited to talk at quite a few workshops this year and am close to reaching the limit of travel I can handle. Just want to help the organizers advertise these events (most of them on sublinear algorithms and complexity). Please, consider attending and help spread the word!</p>
<ul>
<li><a href="https://warwick.ac.uk/fac/sci/dcs/research/focs/conf2017">Workshop on Algorithms for Data Summarization</a> at the University of Warwick, UK (March 19-22). Organized by Graham Cormode and Artur Czumaj.</li>
<li>68th Midwest Theory Day(s) at TTI-Chicago (April 12-13). Organized by Madhur Tulsiani, Aravindan Vijayaraghavan and Anindya De among others.</li>
<li>Workshop on Sublinear Algorithms (June 11-13) and 2nd Workshop on Local Algorithms (June 14-15) at MIT. WoLA is organized by Mohsen Ghaffari, Reut Levi, Moti Medina, Andrea Montanari, Elchanan
Mossel and Ronitt Rubinfeld.</li>
<li><a href="https://simons.berkeley.edu/complexity2018-2">Workshop on Interactive Complexity</a> at the Simons Institute, Berkeley (October 15-19). Organized by Kasper Green Larsen, Mark Braverman and Michael Saks.</li>
</ul>
<p>Hope to see some of you there!</p>
<p><a href="http://grigory.github.io/blog/workshops-2018/">Workshops in 2018</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on February 23, 2018.</p><![CDATA[BeyondMR18 Deadline Approaching]]>http://grigory.github.io/blog/beyondmr-20182018-01-29T00:00:00+00:002018-01-29T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<div align="center"><img alt="BeyondMR18" src="http://grigory.github.io/blog/pics/beyondmr18.png" /> </div>
<p><br /></p>
<p>The deadline for submissions to <a href="https://sites.google.com/site/beyondmr2018/home">BeyondMR’18</a> (5th Algorithms and Systems for MapReduce and Beyond Workshop) is in about 3 weeks.
This workshop will be held in conjunction with SIGMOD/PODS.
As a PC member I would personally like to stress the “Beyond” part as both theory and systems have by now gone way further than just MapReduce.
Please, <a href="https://sites.google.com/site/beyondmr2018/submission">consider submitting you work</a> – you will get feedback from a <a href="https://sites.google.com/site/beyondmr2018/program-committee">healthy mix of researchers and engineers from both academia and industry</a>.</p>
<p><a href="http://grigory.github.io/blog/beyondmr-2018/">BeyondMR18 Deadline Approaching</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on January 29, 2018.</p><![CDATA[What's New in the Big Data Theory 2017]]>http://grigory.github.io/blog/whats-new-in-big-data-theory-20172018-01-27T00:00:00+00:002018-01-27T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<div align="center"><img alt="Happy 2018!" src="http://grigory.github.io/blog/pics/o2017.png" /> </div>
<p><br /></p>
<p>This year I will continue the <a href="http://grigory.us/blog/whats-new-in-big-data-theory-2016/">tradition started last year</a> and summarize a few papers on efficient algorithms for big data that caught my attention last year.
Same disclaimers as last year apply and this is by no means supposed to be the list of “best” papers in the field which is quite loosely defined anyway (e.g. I will intentionally avoid deep learning and gradient descent methods here as I am not actively working in these areas myself and there are a lot of resources on these topics already).
In particular, this year it was even harder to pick clear favorites so it is even more likely that I have missed some excellent work.
Below I will assume familiary with the basics of streaming algorithms and the massively parallel computation model (MPC) discussed in an <a href="http://grigory.us/blog/mapreduce-model/">earlier post</a>.</p>
<p>Before we begin let me quickly plug some of my own work from last year.
With my student Adithya Vadapalli we have a new paper ``<a href="https://arxiv.org/pdf/1710.01431.pdf ">Massively Parallel Algorithms and Hardness of Single-Linkage Clustering under <script type="math/tex">\ell_p</script>-distances</a>’’. As it turns out, while single-linkage clustering and minimum spanning tree problems are the same for exact computation, for vector data round complexity of approximating these two problems in the MPC model is quite different.
In <a href="http://grigory.us/files/approx-linsketch.pdf">another paper</a> I introduce a study of approximate binary linear sketching of valuation functions.
This is an extension of our <a href="https://eccc.weizmann.ac.il/report/2016/174/">recent study</a> of binary linear sketching to the case when the function of interest should only be computed approximately.</p>
<!--
Lack of clear favorites probably means that we might not see all of these results taught in advanced algorithms classes but rather used for reading groups, etc.
Many of the papers listed below have been discussed in detail in our ``algorithms for big data'' reading group here at IU.
-->
<h2>New Massively Parallel Algorithms for Matchings</h2>
<p>Search for new algorithms for matchings has lead to development of new algorithmic ideas for many decades (motivating the study of the class P of polynomial-time algorithms) and this year is no exception.
Two related papers on matchings caught my attention this year:</p>
<ul>
<li>“<a href="https://arxiv.org/abs/1707.03478">Round Compression for Parallel Matching Algorithms</a>” by Czumaj, Lacki, Madry, Mitrovic, Onak and Sankowski.</li>
<li>“<a href="https://arxiv.org/abs/1711.03076">Coresets Meet EDCS: Algorithms for Matching and Vertex Cover on Massive Graphs</a>” by Assadi, Bateni, Bernstein, Mirrokni and Stein.</li>
</ul>
<p>Both papers are highly technical but achieve similar results.
The first paper gives an <script type="math/tex">O((\log \log |V|)^2)</script>-round MPC algorithm for the maximum matching problem that uses <script type="math/tex">O(|V|)</script> memory per machine. The second paper improves the number of rounds down to <script type="math/tex">O(\log \log |V|)</script> using slightly larger memory <script type="math/tex">O(|V| polylog (|V|))</script> per machine.
Using a standard reduction mentioned in the latter paper both papers can achieve multiplicative <script type="math/tex">(1+\epsilon)</script>-approximation for any constant <script type="math/tex">\epsilon > 0</script>.
These results should be contrasted with the <a href="http://theory.stanford.edu/~sergei/papers/spaa11-matchings.pdf">previous work</a> by Lattanzi, Moseley, Suri and Vassilvitskii who give <script type="math/tex">1/c</script>-round algorithms at the expense of using <script type="math/tex">O(|V|^{1 + c})</script> memory per machine for any constant <script type="math/tex">c > 0</script>.
Overall, this is remarkable progress but likely not the end of the story.</p>
<h2>Massively Parallel Methods for Dynamic Programming</h2>
<p>Dynamic programming, <a href="https://www.rand.org/content/dam/rand/pubs/papers/2008/P550.pdf">pioneered by Bellman at RAND</a>, is one of the key techniques in algorithm design. Some would even go as far as saying that there are only two algorithmic tecniques and dynamic programming is one of them.
However, dynamic programming programming is notoriously sequential and difficult to use for sublinear time/space computation.
Most successful stories of speeding up dynamic programming so far have been problem-specific and often highly non-trivial.</p>
<p>In their paper “<a href="http://www.andrew.cmu.edu/user/moseleyb/papers/stoc17-main279.pdf">Efficient Massively Parallel Methods for Dynamic Programming</a>” (STOC’17) Im, Moseley and Sun suggest a fairly generic approach for designing massively parallel dynamic programming algorithms.
Three textbook dynamic programming problems can be handled within their framework:</p>
<ul>
<li>Longest Increasing Subsequence: multiplicative <script type="math/tex">(1+\epsilon)</script>-approximation in <script type="math/tex">O(1/\epsilon^2)</script> rounds of MPC.</li>
<li>Optimal Binary Search Tree: multiplicative <script type="math/tex">(1+\epsilon)</script>-approximation in <script type="math/tex">O(1)</script> rounds of MPC.</li>
<li>Weighted Interval Scheduling: multiplicative <script type="math/tex">(1+\epsilon)</script>-approxiamtion in <script type="math/tex">O(\log 1/\epsilon)</script> rounds of MPC.</li>
</ul>
<p>On a technical level this paper identifies two key properties that these problems have in common: monotonicity and decmoposability. Montonicity just requires that the answer to a subproblem should always be at most (for maximization)/at least(for minimization) the answer to the problem itself.
Decomposability is more subtle and requires that the problem can be decomposed into a two-level recursive family of subproblems where entries of the top level are called groups and entries of the bottom level are called blocks.
It should then be possible to 1) construct a nearly optimal solution for the entire problem by concatenating solutions for subproblems, 2) construct a nearly optimal solution for each group from only a constant number of blocks.
While monotonicity holds for many standard dynamic problems, decomposability seems much more restrictive so it is interesting to see whether this technique can be extended to some other problems.</p>
<p>See Ben Moseley’s <a href="http://caml.indiana.edu/slides/ben.pdf ">presentation</a> at the Midwest Theory Day for more details.</p>
<h2>Randomized Composable Coresets for Matching and Vertex Cover</h2>
<p>The simplest massively parallel algorithm one can think of can be described as follows: partition the data between <script type="math/tex">k</script> machines, let each machine select a small subset of the data points, collect these locally selected data points on one central machine and compute the solution there.
The hardest part here is the design of the local subset selection procedures.
Such subsets are called coresets and have received a lot of attention the study of algorihtms for high-dimensional vectors, see e.g. <a href="http://sarielhp.org/p/04/survey/survey.pdf">this survey</a> by Agarwal, Har-Peled and Varadarajan.</p>
<p>Note that in the distributed setting construction of coresets can be affected by the initial distribution of data.
In fact, for the maximum mathcings problem non-trivially small coresets can’t be used to approximate the maximum matching up to a reasonable error (see <a href="http://grigory.us/files/soda16.pdf">our paper</a> with Assadi, Khanna and Li for the exact statement which in fact rules out not just coresets but any small-space representations) if no assumptions about the distribution of the data is made.</p>
<p>However, if the initial distribution of data is uniformly random then the situation changes quite dramatically.
As shown in “<a href="http://www.seas.upenn.edu/~sassadi/stuff/papers/randomized-coreset_matching-vc.pdf">Randomized Composable Coresets for Matching and Vertex Cover</a>” by Assadi and Khanna (best paper at SPAA’17) for uniformly distributed data coresets of size <script type="math/tex">O(|V| polylog |V|)</script> can be computed locally and then combined to obtain <script type="math/tex">O(polylog |V|)</script>-approximation.</p>
<h2>Optimal Lower Bounds for L<sub>p</sub>-Samplers, etc </h2>
<p>Consider the following problem: a vector <script type="math/tex">x \in \{0,1\}^n</script> (initially consisting of all zeros) is changed by a very long sequence of updates that can flip an arbitrary coordinate of this vector. After seeing this sequence of updates can we retrieve some non-zero entry of this vector without storing all <script type="math/tex">n</script> bits used to represent <script type="math/tex">x</script>?
Surprisingly, the answer is “yes” and this can be done with only <script type="math/tex">O(poly(\log n))</script> space.
If we are required to generate a uniformly random non-zero entry of <script type="math/tex">x</script> then the corresponding problem is called <script type="math/tex">L_0</script>-sampling.</p>
<p><script type="math/tex">L_0</script>-sampling turns out to be a remarkably useful primitive in the design of small-space algorithms. Almost all known streaming algorithms for dynamically changing graphs are based on <script type="math/tex">L_0</script>-sampling or its relaxation where the uniformity requirement is removed.</p>
<p>While almost optimal upper and lower bounds on the amount of space necessary for <script type="math/tex">L_0</script>-sampling have been known since <a href="https://arxiv.org/pdf/1012.4889.pdf">the work</a> of Jowhari, Saglam and Tardos, there were still gaps in terms of the dependence on success probability.
If our recovery of a non-zero entry of <script type="math/tex">x</script> has to be successful with probability <script type="math/tex">1 - \delta</script> then the tight bound on space turns out to be <script type="math/tex">\Theta(\min(n, \log (1/\delta) \log^2 (\frac{n}{\log 1/\delta})))</script>.
This is one of the results of the <a href="http://people.seas.harvard.edu/~minilek/publications/papers/sampler_lb_merged.pdf">recent FOCS’17 paper</a> by Kapralov, Nelson, Pahocki, Wang, Woodruff and Yahyazadeh.</p>
<h2>Looking forward to more results in 2018!</h2>
<p>Please, let me know if there are any other interesting papers that I missed.
Also here is a quick shout out goes to some other papers that were close to making the above list:</p>
<ul>
<li>“<a href="https://nips.cc/Conferences/2017/Schedule?showEvent=9453">Affinity Clustering: Hierarchical Clustering at Scale</a>” by Bateni, Behnezhad, Derakhshan, Hajiaghayi, Kiveris, Lattanzi and Mirrokni (NIPS’17).</li>
<li>“<a href=" https://www.ilyaraz.org/static/papers/lshforest.pdf">LSH Forest: Practical Algorithms Made Theoretical</a>” by Andoni, Razenshteyn and Shekel Nosatzki (SODA’17).</li>
<li>“<a href="https://arxiv.org/abs/1610.08096">Almost Optimal Streaming Algorithms for Coverage Problems</a>” by Bateni, Esfandiari and Mirrokni (SPAA’17).</li>
<li>“<a href="http://www.cs.utexas.edu/~ecprice/papers/compressed-generative.pdf">Compressed Sensing using Generative Models</a> by Bora, Jalal, Price and Dimakis (ICML’17).</li>
<li>“<a href="https://arxiv.org/pdf/1707.08484.pdf ">MST in O(1) Rounds of Congested Clique</a>” by Jurdzinski and Nowicki (SODA’18).</li>
</ul>
<p><a href="http://grigory.github.io/blog/whats-new-in-big-data-theory-2017/">What's New in the Big Data Theory 2017</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on January 27, 2018.</p><![CDATA[Video Recording Screencasts of Talks]]>http://grigory.github.io/blog/video-recording-class-screencast2017-08-01T00:00:00+00:002017-08-01T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<p>Here is the setup I came up with for video recording live lectures and talks.</p>
<p>Hardware:</p>
<ul>
<li> Laptop with a camera: MacBook</li>
<li> Tablet with a pen/stylus: iPad (Pro), MSRP $1K</li>
<li> Microphone: Sennheiser ME 3-EW, MSRP $200</li>
<li> Projector (connected to the laptop)</li>
</ul>
<p>Software:</p>
<ul>
<li> Open Broadcasting Software (laptop), free</li>
<li> Microsoft Powerpoint and Office 365 (tablet), free from IU</li>
<li> AirServer (laptop) and AirServer Connect (tablet), $10</li>
</ul>
<p>Full description and demo on Youtube:
<br /></p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/pYfldndPkWs" frameborder="0" allowfullscreen=""></iframe>
<p><a href="http://grigory.github.io/blog/video-recording-class-screencast/">Video Recording Screencasts of Talks</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on August 01, 2017.</p><![CDATA[Advice on video recording lectures?]]>http://grigory.github.io/blog/video-recording-class-question2017-07-31T00:00:00+00:002017-07-31T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<p>Looking for advice: I am thinking of making videos of my class this Fall, the setup I am thinking about is a tablet (just ordered a large iPad Pro) + mic and camera + some software (maybe Google Hangouts live?) to record a screencast of the tablet with my video on the side. Did anyone have experience doing something like this? What hardware and software did you use? If this sounds too clunky to set up on your own, is it worth creating a MOOC and/or having professionals do the recording for you?</p>
<p><a href="http://grigory.github.io/blog/video-recording-class-question/">Advice on video recording lectures?</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on July 31, 2017.</p><![CDATA[Theory Jobs 2017]]>http://grigory.github.io/blog/theory-jobs-20172017-06-08T00:00:00+00:002017-06-08T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<p><b>UPD</b>: Lance created his Theory Jobs spreadsheet, I’ve moved all the information there and changed the link below to Lance’s spreadsheet.</p>
<p><a href="https://docs.google.com/spreadsheets/d/1xBpgBZXcSxjEAbU7SYCeXOJOJtPeXYCOFArqL28Uho8/edit?usp=sharing">Here is a link</a> to a crowdsourced spreadsheet created by Lance Fortnow that collects information about theory jobs this year. <strike>Previously Lance set it up, but this year it is getting late in the year so I decided to go ahead and create one myself. In previous years the jobs post was up a few weeks back so I hope I am not jumping the gun here. Rules about the spreadsheet have been copied from <a href="http://blog.computationalcomplexity.org/2016/05/theory-jobs-2016.html">Lance's last year post</a> and all edits to the document are anonymized.
</strike></p>
<ul>
<li>Separate sheets for faculty, industry and postdoc/visitors. </li>
<li>People should be connected to theoretical computer science, broadly defined.</li>
<li>Only add jobs that you are absolutely sure have been offered and accepted. This is not the place for speculation and rumors. </li>
<li>You are welcome to add yourself, or people your department has hired. </li>
</ul>
<p>This document will continue to grow as more jobs settle.</p>
<iframe width="100%" height="100%" src="https://docs.google.com/spreadsheets/d/1_DkI62xQF0CSNavP9AneOQFFccm-EXu_zvO_CWoYBv0/pubhtml?widget=true&headers=false"></iframe>
<p><a href="http://grigory.github.io/blog/theory-jobs-2017/">Theory Jobs 2017</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on June 08, 2017.</p><![CDATA[67th Midwest Theory Day]]>http://grigory.github.io/blog/mtd2017-04-30T00:00:00+00:002017-04-30T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<div align="center"><img alt="IU :)" src="http://grigory.github.io/blog/pics/iu-sample-gates.jpg" /> </div>
<p><br />
<a href="http://caml.indiana.edu/mtd.html">67th Midwest Theory</a> <strike>Day</strike> Weekend took place two weeks ago here at Indiana University, Bloomington organized by <a href="http://homes.soic.indiana.edu/qzhangcs/">Qin Zhang</a>, <a href="http://homes.soic.indiana.edu/yzhoucs/">Yuan Zhou</a> and myself. We were lucky to have a flurry of fantastic speakers from the Midwest and three external headliners: <a href="http://www.mit.edu/~andoni/">Alex Andoni</a>, <a href="https://people.csail.mit.edu/mirrokni/Welcome.html">Vahab Mirrokni</a> and <a href="https://www.cs.cmu.edu/~odonnell/ ">Ryan O’Donnell</a>. Slides are now posted online.</p>
<p>As organizers we’ve decided to experiment with the format of this MTD making it a 2-day event. Based on feedback we received we believe that this format has worked out perfectly. For a geographically spread out Midwest area travel logistics makes overheads of attending a one-day event often too much for many of those interested to come.</p>
<p><a href="http://grigory.github.io/blog/mtd/">67th Midwest Theory Day</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on April 30, 2017.</p><![CDATA[What's New in the Big Data Theory 2016]]>http://grigory.github.io/blog/whats-new-in-big-data-theory-20162016-12-30T00:00:00+00:002016-12-30T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<div align="center"><img alt="Happy 2017!" src="http://grigory.github.io/blog/pics/o2016.png" /> </div>
<p><br /></p>
<p>This post will give an overview of papers on theory of algorithms for big data that caught my attention in 2016.
The basic rule that I used when making the list was whether I can see these results being included into some of the advanced graduate classes on algorithms in the future.
Also, while I obviously can’t include my own results here, among my own 2016 papers my two personal favorites are <a href="http://grigory.us/files/soda16.pdf">tight bounds on space complexity of computing approximate matchings in dynamic streams</a> (with S. Assadi, S. Khanna and Y. Li) and the <a href="http://eccc.hpi-web.de/report/2016/174/"><script type="math/tex">\mathbb F_2</script>-sketching paper</a> (with S. Kannan and E. Mossel and some special credit to Swagato Sanyal who subsequently improved the dependence on error in one of our main theorems).</p>
<p>It’s been a great year with several open problems resolved, old algorithms improved and new lines of research started.
All papers discussed below are presented in no particular order and their selection is clearly somewhat biased towards my own research interests.</p>
<h2>Maximum Weighted Matching in Semi-Streaming</h2>
<p>Sweeping both the best paper and the best student paper awards at the upcoming 28th ACM Symposium on Discrete Algorithms is a paper on semi-streaming algorithms for maximum weighted matching by graduate students Ami Paz and Gregory Schwartzman.
In semi-streaming we are given one pass over edges of an <script type="math/tex">n</script>-vertex and only <script type="math/tex">\tilde O(n)</script> bits of space.
It is easy to get a 2-approximation to the maximum matching by just maintaining the maxim<strong>al</strong> matching of the graph.
However, for weighted graphs maximal matching no longer guarantees a 2-approximation.</p>
<p>A long line of work has previously given constant factor approximations for this problem and finally we have a <script type="math/tex">2+\epsilon</script>-approixmation.
It is achieved via a careful implementation of the primal-dual algorithm for matchings in the semi-streaming setting.
It may seem somewhat surprising that primal-dual hasn’t been applied to this problem before since in the area of approximation algorithms it is a pretty standard way of reducing weighted problems to their unweighted versions, but the exact details of how to implement primal-dual in the streaming setting are quite delicate. I couldn’t find a version of this paper online so the best bet might be to wait for the SODA proceedings.</p>
<p>Now the big open question is whether one can beat the 2-approximation which is open even in the unweighted case.</p>
<h2>Shuffles and Circuits</h2>
<p>Best paper award at the 28th ACM Symposium on Parallelism in Algorithms and Architectures went to ‘‘<a href="http://theory.stanford.edu/~sergei/papers/spaa16-mrshuffle.pdf">Shuffles and Circuits</a>’’, a paper by Roughgarden, Vassilvitskii and Wang.
This paper emphasizes the difference between rounds of MapReduce and depth of a circuit.
Because some of the machines can choose to stay silent between the rounds a round of MapReduce can be more complex than a layer of a circuit as the machines sending input to the next round might depend on the original input data.
The paper shows that nevertheless the standard circuit complexity ‘‘degree bound’’ can be applied to MapReduce computation.
I.e. any Boolean function whose polynomial representation has degree <script type="math/tex">d</script> requires <script type="math/tex">\Omega(\log_s d)</script> rounds of MapReduce using machines with space <script type="math/tex">s</script>.
This implies an <script type="math/tex">\Omega(\log_s n)</script> lower bound on the number of rounds for computing connectivity of a graph.
The authors also make explicit a connection between the MapReduce model and <script type="math/tex">NC^1</script> (see definition <a href="https://en.wikipedia.org/wiki/NC_(complexity) ">here</a>) which implies that improving lower bounds beyond <script type="math/tex">\log_s n</script> for polynomially many machines would imply separating <script type="math/tex">P</script> from <script type="math/tex">NC^1</script>.</p>
<h2>Beating Counting Sketches for Insertion-Only Streams</h2>
<p>Both <a href="http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarCF.pdf ">CountSketch</a> and <a href="https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch">Count-Min Sketch</a>, which are textbook approximate data structures for storing very large dynamically changing numerical tables in small space, have been improved this year under the assumption that data in the table is only incremented.
These improvements are for the most common application of such sketches to ``heavy hitters’’– the task of recovering largest entries from the table approximately.
For CountSketch see <a href="http://researcher.watson.ibm.com/researcher/files/us-dpwoodru/bciw16.pdf">the paper</a> by Braverman, Chestnut, Ivkin, Woodruff from STOC’16 and for CountMin Sketch <a href="https://arxiv.org/abs/1603.00213">the paper</a> by Bhattacharyya, Dey and Woodruff from PODS’16.</p>
<h2>Optimality of the Johnson-Lindenstrauss Transform</h2>
<p>Two papers by <a href="https://arxiv.org/pdf/1609.02094v1.pdf ">Green Larsen and Nelson</a> and by <a href="http://www.cs.tau.ac.il/~nogaa/PDFS/compression3.pdf">Alon and Klartag</a> have resolved the question of proving optimality of the Johnson-Lindenstrauss transform.
Based on doing a projection on random low-dimensional subspace JL-transform is the main theoretical tool for dimensionality reduction of high-dimensional vectors.
As these papers show no low-dimensional embedding and furthermore no data structure can achieve better bit complexity than <script type="math/tex">\Theta(n \log n/\epsilon^2)</script> for <script type="math/tex">(1 \pm \epsilon)</script>-approximating all pairwise distances between <script type="math/tex">n</script> vectors in Euclidean space (for a certain regime of parameters).
This matches the Johnson and Lindenstrauss upper bound and improves an old lower bound of <script type="math/tex">\Omega\left(\frac{n \log n}{ \epsilon^2 \log 1/\epsilon}\right)</script> due to Alon.
Even though Alon’s argument is significantly simpler getting an optimal lower bound is a very nice achievement.</p>
<h2>Fast Algorithm for Edit Distance if It's Small</h2>
<p><a href="https://en.wikipedia.org/wiki/Edit_distance ">Edit distance</a> is one of the cornerstone metrics of text similairity in computer science. It can be computed in quadratic time using standard dynamic programming which is optimal assuming SETH due to the <a href="https://arxiv.org/abs/1412.0348 ">result of Backurs and Indyk</a>.
Edit distance also has a number of applications including comparing DNAs in computational biology.
In these applications it is usually reasonable to assume that edit distance is only interesting if it is not too large.
Unfortunately, this doesn’t help speed up the standard dynamic program.
A series of papers, including two papers from this year by <a href="http://iuuk.mff.cuni.cz/~koucky/papers/editDistance.pdf ">Chakraborty, Goldenberg and Koucky</a> (STOC’16) and
<a href="http://homes.soic.indiana.edu/qzhangcs/papers/focs16-ED.pdf ">Belazzogui and Zhang</a> lead to the following result: sketches of size <script type="math/tex">poly(K \log n)</script> bits suffice for computing edit distance <script type="math/tex">\le K</script>. Such sketches can be applied not just in centralized but also in distributed and streaming settings making it possible to compress input strings down to size that (up to logarithmic factors) only depends on <script type="math/tex">K</script>.</p>
<h2>Tight Bounds for Set Cover in Streaming</h2>
<p>Set Cover is a surprisingly powerful abstraction for a lot of applications that involve providing coverage for some set of terminals.
Given a collection of sets <script type="math/tex">S_1, \dots, S_m \subseteq [n]</script> the goal is to find the smallest cardinality subcollection of these sets such that their union is <script type="math/tex">[n]</script>, i.e. all of the underlying elements are covered.
In approximation algorithms a celebrated greedy algorithm gives an <script type="math/tex">O(\log n)</script>-approximation for this problem.
In streaming there has been a lot of interest lately in approximating classic combinatorial optimization problems in small space with Set Cover being one of the main examples.
For an overview from last year check Piotr Indyk’s <a href="https://www.youtube.com/embed/_4mM1UGI9Dg?list=PLqxsGMRlY6u659-OgCvs3xTLYZztJpEcW ">talk</a> from the <a href="http://grigory.us/mpc-workshop-dimacs.html ">DIMACS Workshop on Big Data and Sublinear Algorithms</a>.</p>
<p>As <a href="http://www.seas.upenn.edu/~sassadi/stuff/papers/tbfsscotscp-conf.pdf ">this STOC’16 paper</a> by Assadi, Khanna and Li shows savings in space for streaming Set Cover can only be proportional to the loss in approximation. In particular, if we are interested in computing Set Cover which is within a multiplicative factor <script type="math/tex">\alpha</script> of the optimum then:
1) for computing the cover itself space <script type="math/tex">\tilde \Theta(mn/\alpha)</script> is necessary and sufficient,
2) for just esimating the size space <script type="math/tex">\tilde \Theta(mn/\alpha^2)</script> is necessary and sufficient.</p>
<h2>Polynomial Lower Bound for Monotonicity Testing</h2>
<p>Finally a polynomial lower bound has been shown for adaptive algorithms for testing monotonicity of Boolean functions <script type="math/tex">f \colon \{0,1\}^n \rightarrow \{0,1\}</script>.
The lower bound implies that any algorithm that can tell whether <script type="math/tex">f</script> is monotone or differs from monotone on a constant fraction of inputs has to query at least <script type="math/tex">\tilde \Omega(n^{1/4})</script> values of <script type="math/tex">f</script>.
This result is due to <a href="https://arxiv.org/abs/1511.05053 ">Belovs and Blais</a> (STOC’16) and is in contrast with the upper bound of <script type="math/tex">\tilde O(\sqrt{n})</script> by Khot, Minzer and Safra from last year’s FOCS.
Probably the biggest result in property testing this year.</p>
<h2>Linear Hashing is Awesome</h2>
<p>While ‘‘<a href="http://ieee-focs.org/FOCS-2016-Papers/3933a345.pdf ">Linear Hashing is Awesome</a>’’ by Mathias Bæk Tejs Knudsen doesn’t fall into the traditional ‘‘sublinear algorithms for big data’’ category this paper still has some sublinear flavor because of its focus on very fast query times.
Linear hashing is a classic hashing scheme
<script type="math/tex">h(x) = ((ax + b) \mod p) \mod m</script>
where <script type="math/tex">a,b</script> are random. It is very often used in practice and discussed extensively in CLRS.
This paper proves that linear hashing <strike>is awesome</strike> results in expected length of the longest chain of only <script type="math/tex">O(n^{1/3})</script> compared to the previous simple bound of <script type="math/tex">O(\sqrt{n})</script>.</p>
<p>Finally, this paper also decisively wins my ‘‘Best Paper Title 2016’’ award.</p>
<h2>Looking forward to more cool results in 2017!</h2>
<p>There has been a lot of great results in 2016 and it’s hard to mention all of them in one post and I certainly might have missed some exciting papers. Here is a quick shout out to some other papers that were close to making the above list:</p>
<ul>
<li><a href="https://arxiv.org/abs/1507.04299 ">Tight Bounds for Data-Dependent LSH</a> by Andoni and Razenshteyn from SoCG'16.</li>
<li><a href="http://arxiv.org/abs/1603.05346 ">Optimal Quantile Estimation in Streams</a> by Karnin, Lang and Liberty from FOCS'16.
</li>
</ul>
<p>Happy 2017!</p>
<p><a href="http://grigory.github.io/blog/whats-new-in-big-data-theory-2016/">What's New in the Big Data Theory 2016</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on December 30, 2016.</p><![CDATA[The Binary Sketchman]]>http://grigory.github.io/blog/the-binary-sketchman2016-10-07T00:00:00+00:002016-10-07T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<p>In this post I will talk about some of my recent work with <a href="http://www.cis.upenn.edu/~kannan/">Sampath Kannan</a> and <a href="https://stat.mit.edu/people/elchanan-mossel/">Elchanan Mossel</a> on linear methods for binary data compression. The paper is <a href="http://eccc.hpi-web.de/report/2016/174/">available here</a>, slides from my talk at Penn are <a href="http://grigory.us/files/talks/penn16.pdf">here</a> and another talk at Columbia is <a href="http://www.cs.columbia.edu/theory/f16-theoryread.html#Grigory">coming up on Nov 21</a>.</p>
<p>Given very large data represented in binary format as a string of length <script type="math/tex">n</script>, i.e. <script type="math/tex">x \in \{0,1\}^n</script>
we are interested in a compression algorithm that can transform <script type="math/tex">x</script> into a much shorter binary string <script type="math/tex">y \in \{0,1\}^k</script>.
Here <script type="math/tex">k \ll n</script> so that we can achieve some non-trivial savings in space.
Moreover, if <script type="math/tex">x</script> changes in the future we would like to be able to update our compressed version of it (without having to store the original <script type="math/tex">x</script>).</p>
<p>Clearly compression introduces some loss making it impossible to recover certain properties of the original data from the compressed string.
However, if we know in advance which property of <script type="math/tex">x</script> we are interested in then efficient compression often becomes possible.
We will model the property of interest as a binary function <script type="math/tex">f:\{0,1\}^n \rightarrow \{-1,1\}</script> which labels all possible <script type="math/tex">x</script>’s with two labels.
So our goal will be to be able to: 1) perform this binary classification, i.e. compute <script type="math/tex">f(x)</script> using compressed data <script type="math/tex">y</script> only, 2) do this even if <script type="math/tex">x</script> changes over time – updates for us will be bit flips in the coordinates of <script type="math/tex">x</script> specified by the index of the bit that is getting flipped.</p>
<p>Finally, if <script type="math/tex">x</script> is so big that it can’t be stored locally and has to be divided into chunks stored across multiple machines then we will be able to compress the chunks locally and then combine them on a central server into a compressed version of the entire data – one simple round of MapReduce or whatever your favorite distributed framework is.</p>
<p>To make the above discussion less abstract let’s consider a machine learning application – evaluating a linear classifier over binary data.
Let’s say we have trained a linear classifier of the form <script type="math/tex">sign(\sum_{i = 1}^n w_i x_i - \theta)</script> where sign is the sign function.
Is it possible to compress <script type="math/tex">x</script> in such a way that we can still evaluate our classifier in the scenarios described above?
Turns out we can compress the input down to <script type="math/tex">O(\theta/m \log (\theta/m))</script> bits where <script type="math/tex">m</script> is a parameter of the linear classifier known as its margin. Moreover, no compression scheme can do better.</p>
<h1 id="introducing-the-binary-sketchman">Introducing the Binary Sketchman</h1>
<div align="center"><img alt="The Binary Sketchman" src="http://grigory.github.io/blog/pics/binary-sketchman-final.png" /> </div>
<p><br /></p>
<p>While the setting described above may seem quite challenging it can be handled through a framework of linear sketching.
In the binary case the interpretation of linear sketching is particularly simple as our binary sketchman is just going to compute <script type="math/tex">k</script> parities of the bits of <script type="math/tex">x</script>, say for <script type="math/tex">k=3</script>:</p>
<script type="math/tex; mode=display">x_4 \oplus x_2, \quad x_{42}, \quad x_{566} \oplus x_{610} \oplus x_{239} \oplus x_{57}.</script>
<p>In a matrix form this corresponds to computing <script type="math/tex">Mx</script> where <script type="math/tex">M</script> is a <script type="math/tex">k \times n</script> binary matrix and the operations are performed over <script type="math/tex">\mathbb F_2</script>.
Note that now our sketch easily satisfies all the requirement above since as <script type="math/tex">x</script> changes we can just update the corresponding parities. In the distributed case we can compute them locally and then add up on a central server.</p>
<p>Unfortunately the power of a deterministic sketchman who just uses a fixed set of parities is quite limited and no such sketchman can compress even a simple linear classifier down to less than <script type="math/tex">n</script> bits.
In fact, even for the OR function <script type="math/tex">f = x_1 \vee x_2 \vee \dots \vee x_n</script> no deterministic sketch can have less than <script type="math/tex">n</script> bits.
So our binary sketchman will “<a href="http://www.cs.cmu.edu/~haeupler/15859F14/">unleash the power of randomization</a>” in his quest for a perfect sketch.
According to <a href="http://www.cs.cmu.edu/~haeupler/">Bernhard Haeupler</a> this can be quite dramatic and looks kind of like this:</p>
<div align="center"><img width="300px" alt="The power of randomness unleashed" src="http://www.cs.cmu.edu/~haeupler/15859F14/images/posternoinf.jpg" /> </div>
<p><br />
So our sketchman will instead pick the matrix <script type="math/tex">M</script> randomly while the rest is the same as before.
Now the OR function is easy to handle: pick a parity over a random subset of <script type="math/tex">\{1, \dots, n\}</script> where each coordinate is included with probability <script type="math/tex">1/2</script>.
If <script type="math/tex">OR(x) = 1</script> then this parity catches a non-zero coordinate of <script type="math/tex">x</script> with probability <script type="math/tex">1/2</script> and thus evaluates to <script type="math/tex">1</script> with probability at least <script type="math/tex">1/4</script>.
If <script type="math/tex">OR(x) = 0</script> then the parity never evaluates to <script type="math/tex">1</script> so we can distinguish the two cases with probability <script type="math/tex">1 - \delta</script> using <script type="math/tex">O(\log 1/\delta)</script> such parities.
This illustrates a more general idea – if <script type="math/tex">f</script> is a constant function on all but <script type="math/tex">m</script> different inputs then a sketch of size <script type="math/tex">O(\log m + \log 1/\delta)</script> suffices.</p>
<p>Now for linear thresholds the high-level ideas behind this sketching process are as follows:
1) observe that any linear threshold function takes the same value on all but <script type="math/tex">n^{O(\theta/m)}</script> inputs,
2) apply the same argument as above to obtain a sketch of size <script type="math/tex">O(\theta/m \log n + \log 1/\delta)</script>.
The only thing missing in the above argument is that we still have dependence on <script type="math/tex">n</script>.
This can be avoided if we first hash the domain reducing its size down to <script type="math/tex">n' = poly(\theta/m)</script> which replaces <script type="math/tex">n</script> in the above calculations giving us <script type="math/tex">O(\theta/m \log \theta/m + \log 1/\delta)</script>.
While this compression method is quite simple the remarkable fact is that it can’t be improved.
Even for the simplest threshold function that corresponds to a threshold for the Hamming weight of <script type="math/tex">x</script>, i.e. <script type="math/tex">sign(\sum_{i = 1}^n x_i - k)</script>, any compression mechanism would require <script type="math/tex">\Omega(k \log k)</script> bits as follows from <a href="http://link.springer.com/chapter/10.1007/978-3-642-32512-0_44">this work</a> by Dasgupta, Kumar and Sivakumar.
Note that it isn’t assumed that the protocol is based on linear sketching – it can be an arbitrary scheme.</p>
<h1 id="the-power-of-randomized-binary-sketchman">The Power of Randomized Binary Sketchman</h1>
<p>Linear sketching by itself is not a new idea and has been studied extensively in the last two decades.
See surveys by <a href="http://researcher.watson.ibm.com/researcher/view.php?person=us-dpwoodru">Woodruff</a> and <a href="http://people.cs.umass.edu/~mcgregor/">McGregor</a> on how it can be applied to problems in <a href="http://researcher.ibm.com/files/us-dpwoodru/wNow3.pdf ">numerical linear algebra</a> and <a href="http://link.springer.com/referenceworkentry/10.1007/978-3-642-27848-8_796-1">graph compression</a>.
However, this work focuses on linear sketching over large finite fields (used to represent real values with bounded precision).
Nevertheless some striking results are known about linear sketching that are applicable in our context as well.
In particular, if <script type="math/tex">x</script> is updated through a very long (triply exponential in <script type="math/tex">n</script>) stream of adversarial updates then linear sketches over finite fields are optimal for any function <script type="math/tex">f</script> as shown by Li, Nguyen and Woodruff <a href="https://pdfs.semanticscholar.org/bf89/98d76741f3ee7b4ba1f82524353e7083c3b5.pdf ">here</a> in STOC’14.</p>
<p>As our paper shows the same result holds for much shorter random streams of length <script type="math/tex">\tilde O(n)</script> in a simple model where each update flips uniformly at random chosen coordinate of <script type="math/tex">x</script>.
In other words binary sketching is optimal if in the end of the stream the input <script type="math/tex">x</script> is uniformly distributed.
The proof of this fact is quite technical and relies on a notion of <i>approximate Fourier dimension</i> for Boolean functions that we use to characterize binary sketching under the uniform distribution – check the paper for details if you are interested.
Whether the same result holds for short (length <script type="math/tex">\tilde O(n)</script>, say) adversarial streams is the main open question left open.</p>
<p><a href="http://grigory.github.io/blog/the-binary-sketchman/">The Binary Sketchman</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on October 07, 2016.</p><![CDATA[Teaching Foundations of Data Science]]>http://grigory.github.io/blog/foundations-of-data-science-class2016-08-27T00:00:00+00:002016-08-27T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<p>This week I started teaching a graduate class called “<a href="http://grigory.us/data-science-class.html">Foundations of Data Science</a>” that will be mostly based on an eponymous book by <a href="https://en.wikipedia.org/wiki/Avrim_Blum ">Avrim Blum</a>, <a href="https://en.wikipedia.org/wiki/John_Hopcroft ">John Hopcroft</a> and <a href=" https://en.wikipedia.org/wiki/Ravindran_Kannan ">Ravi Kannan</a>.
The book is still a draft and I am using <a href="http://grigory.us/files/bhk-book.pdf">this version</a>.
Target audience includes advanced undergraduate and graduate level students.
We had some success using this book as a core material for an undergraduate class at Penn this Spring (<a href="http://www.thedp.com/article/2016/02/cis-399-students">link to the news article</a>).
The draft has been around for a while and in fact I ran a reading group that used it four years back when I was in grad school and the book was called
“Computer Science Theory for the Information Age”.</p>
<div align="center"><img width="200px" alt="Keep calm and dig foundations of Data Science" src="http://grigory.github.io/blog/pics/b609-poster-homepage.png" /> </div>
<p>“Data Science” is one of those buzzwords that can mean very different things to different people.
In particular, a new graduate <a href="http://www.soic.indiana.edu/graduate/degrees/data-science/index.html">Masters program in Data Science here at IU</a> attracts hundreds of students from diverse backgrounds.
What I personally really like about the Blum-Hopcroft-Kannan book is that it doesn’t go into any philosophy about the meaning of data science but rather offers a collection of mathematical tools and topics that can be considered as foundational for data science as seen from computer science perspective.
It should be noted that just as any “Foundations of Computing” class has little to do with finding bugs in your code so do this class and book have little to do with data cleaning and other data analysis routine.</p>
<h1>Topics</h1>
<p>While the jury is still out on what topics should be considered as fundamental for data science I think that the Blum-Hopcroft-Kannan book makes a good first step in this direction.</p>
<p>Let’s look at the table of contents:</p>
<ul>
<li>Chapter 2 introduces basic properties of the high-dimensional space, focusing on concentration of measure, properties of high-dimensional Gaussians and basic dimension reduction. </li>
<li>Chapter 3 covers the Singular Value Decomposition (SVD) and its applications (principal component analysis, clustering mixture of Gaussians, etc.).</li>
<li>Chapter 4 focuses on random graphs (primarily in the Erdos-Renyi model).</li>
<li>Chapter 5 introduces random walks and Markov chains, including Markov Chain Monte Carlo methods, random works on graphs and applications such as Page Rank.</li>
<li>Chapter 6 covers the very basics of machine learning theory, including learning basic function classes, perceptron algorithm, regularization, kernelization, support vector machines, VC-dimension bounds, boosting, stochastic gradient descent and a bunch of other topics. </li>
<li>Chapter 7 describes a couple of streaming and sampling methods for big data: frequency moments in streaming and matrix sampling.</li>
<li>Chapter 8 is about clustering methods: k-means, k-center, spectral clustring, cut-based clustering, etc.</li>
<li>Chapters 9 through 11 cover a very diverse set of topics that includes hidden Markov processes, graphical models, belief propagation, topic models, voting systems, compressed sensing, optimization methods and wavelets among others.</li>
</ul>
<h1>Discussion</h1>
<p>Overall this looks like a good stab at the subject and a big advantage of this book is that unlike some of its competitors it treats its topics with mathematical rigor.
The only chapter that I personally don’t really see fit into a “data science” class is Chapter 4. Because of its focus on the Erdos-Renyi model that I haven’t seen being used realistically for graph modeling applications this chapter seems to be mostly of purely mathematical interest.</p>
<p>Selection of some of the smaller topics is a matter of personal taste, especially when it comes to those that are missing.
A couple of quick suggestions is to cover new sketching algorithms for <a href="http://researcher.watson.ibm.com/researcher/files/us-dpwoodru/wNow.pdf">high-dimensional linear regression</a>, <a href="https://en.wikipedia.org/wiki/Locality-sensitive_hashing">locality-sensitive hashing</a> and possibly <a href="http://groups.csail.mit.edu/netmit/sFFT/index.html">Sparse FFT</a>.</p>
<p>Slides will be posted <a href="http://grigory.us/data-science-class.html#lectures">here</a> and I will write a report on the final selection of topics and my experience in the end of semester. Stay tuned :)</p>
<p><a href="http://grigory.github.io/blog/foundations-of-data-science-class/">Teaching Foundations of Data Science</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on August 27, 2016.</p><![CDATA[ESA'16 Deadline Approaching]]>http://grigory.github.io/blog/esa-20162016-04-18T00:00:00+00:002016-04-18T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<p>The deadline for submissions to <a href="http://conferences.au.dk/algo16/esa/">ESA’16</a> (24th European Symposium of Algorithms) is in 3 days.
As a PC member I would like to encourage you to submit your work and also plug the event and its location.</p>
<div align="center"><img alt="" src="http://grigory.github.io/blog/pics/esa16.png" /> </div>
<p><br /></p>
<p>This time the conference is a part of a broader symposium <a href="http://conferences.au.dk/algo16/home/">ALGO’16</a> which will take place in Aarhus, Denmark on August 22-26.
In the spirit of colocation <a href="http://grigory.github.io/blog/stoc-focs-proposal-colocate.html">previously advocated on this blog</a> this symposium brings together several conferences and workshops.
Most relevant to this blog are <a href="">ALGOCLOUD</a> (a new workshop on algorthms for cloud computing) and <a href="http://conferences.au.dk/algo16/massive/">MASSIVE</a> (a workshop on algorithms for massive data). A nice feature of MASSIVE is that it doesn’t have published proceedings. This means that contributions to the workshop can be also published in other conferences.</p>
<div align="center"><img alt="" src="http://grigory.github.io/blog/pics/algo16.png" /> </div>
<p><br /></p>
<p>Aarhus is definitely one of the most vibrant and forward-thinking centers for research in algorithms and theoretical computer science at large in Europe.
I was very lucky to visit the <a href="http://ctic.au.dk/">Center for the Theory of Interactive Computation</a> (CTIC) about 3 years ago.
This Sino-Danish center is a great example of a collaboration between Tsinghua University (the leading computer science institution in China) and its Western partners.</p>
<p>I really enjoyed spending a week at CTIC hosted by <a href="https://www.cs.swarthmore.edu/~brody/">Joshua</a> and <a href="http://web.mit.edu/matulef/www/">Kevin</a>.
Coincidentally a friendly soccer game between CTIC and MADALGO took place during my visit and I got drafted to play against algorithms folks.
MADALGO is another joint center (with MIT and MPI) and these guys clearly knew a better algorithm for soccer than we did.</p>
<p>MADALGO team:</p>
<div align="center"><img alt="" src="http://grigory.github.io/blog/pics/madalgo.jpg" /> </div>
<p><br /></p>
<p>CTIC team:</p>
<div align="center"><img alt="" src="http://grigory.github.io/blog/pics/ctic.jpg" /> </div>
<p><a href="http://grigory.github.io/blog/esa-2016/">ESA'16 Deadline Approaching</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on April 18, 2016.</p><![CDATA[The Simple Economics of Algorithms for Big Data]]>http://grigory.github.io/blog/the-simple-economics-of-algorithms-for-big-data2016-01-20T00:00:00+00:002016-01-20T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<p>
In this blog post I want to suggest a simple reason why you should study your algorithms <b>really</b> well if you want to design algorithms that deal with big data.
This reason comes from <b>the way billings offered by cloud services work</b>.
</p>
<p>
Maybe you remember yourself taking that algorithms class and thinking: “Who really cares if that algorithm uses a bit more time? Can't we just wait a little longer?”.
Or “Ok, we can save some space here, but if it all fits into my RAM anyway then why bother?”.
These are both great reasons not to care too much about efficiency of your algorithms if your data is small, fits into RAM and the running times aren't significant enough to matter anyway.
So you would go on to program your favorite video game and not care about that professor talking about all that big-Oh nonsense.
And in the short run you would be right. While you are developing a prototype of your favorite video game you shouldn't care.
When I was working at a startup I remember myself learning the hard way that <a href="http://c2.com/cgi/wiki?PrematureOptimization ">premature optimization is the root of all evil</a>.
</p>
<div align="center"><img alt="abstruse-goose-video-games" src="http://grigory.github.io/blog/pics/abstruse-goose-video-games.png" /> </div>
<p><br /></p>
<p>
However, once your video game becomes successful and you get to deal with big data that has to be stored and processed in the cloud this reasoning starts to fall short.
Let's say you developed <a href="https://en.wikipedia.org/wiki/Candy_Crush_Saga">Candy Crush Saga</a> (<a href="http://www.standard.co.uk/business/business-news/candy-crush-saga-owner-king-digital-entertainment-valued-at-7bn-9216058.html">valued at $7bn in 2014</a>) and now you are interested in doing some data analytics about your >10 million active users.
You are now considering outsourcing your data storage and computation to the cloud.
Here is where you might want to learn why the design of space and time-efficient algorithms matters for the bottom line of your future business.
<h1>100x more efficient algorithms = 100x less money in billings</h1>
So that time and space your professor was talking about – what does it have to do with your spending on the cloud services?
The answer is surprisingly simple – <b>if you need 100x more time and space then your billing increases 100 times</b>.
Below I used the pricing calculator that comes with Google Compute Engine to see how the cost scales if I want to use 100/1000/10000 identical machines for a year.
<div align="center"><img alt="abstruse-goose-video-games" src="http://grigory.github.io/blog/pics/cloud-pricings.png" /> </div>
<br />
<p>
I was myself surprised to find this out since I expected some economy of scale to kick in. In fact, sometimes it does but usually is quite negligible. Say, you can get an X% discount but that doesn't help much against linear scaling.
</p>
</p>
<p><a href="http://grigory.github.io/blog/the-simple-economics-of-algorithms-for-big-data/">The Simple Economics of Algorithms for Big Data</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on January 20, 2016.</p><![CDATA[Teaching algorithms for Big Data]]>http://grigory.github.io/blog/teaching-algorithms-for-big-data2015-12-24T00:00:00+00:002015-12-24T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<!--<h1>Teaching “algorithms for Big Data”</h1>
-->
<p>“algorithms for Big Data” (sometimes the name can slightly vary) is a new graduate class that has been introduced by many top computer science programs in the recent years.
In this post I would like to share my experience teaching this class at the University of Pennsylvania this semester. Here is the <a href="http://grigory.us/big-data-class.html">homepage</a>.</p>
<div align="center"><img alt="Keep calm and crunch data on o(N)" src="http://grigory.github.io/blog/pics/class-logo-large.png" /> </div>
<p><br /></p>
<p>First off, let me get the most frequently asked question out of the way and say that by “big data” in this class I mean data that doesn’t fit into a local RAM
since if the data fits into RAM then algorithms from the standard algorithms curricula will do the job.
At the moment a terabyte of data is already tricky to fit into RAM so this is where we will draw the line.
In particular, this is so that the <a href="http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html">arguments about beating algorithms for big data using your laptop</a> don’t apply.</p>
<p>Second, I tried to focus as much as possible on algorithms that are known to work in practice and have implementations.
Because this is a theory class we didn’t do programming but I made sure to give links to publicly available implementations whenever possible.
As it is always the case, the best algorithms to teach are never exactly the same as the best implementations.
Even the most vanilla problem of sorting an array in RAM is handled in C++ STL via a combination of QuickSort, InsertionSort and HeapSort.
Picking the right level of abstraction is always a delicate decision to make when teaching algorithms and I am pretty happy with the set of choices made in this offering.</p>
<p>Finally, “algorithms for Big Data” isn’t an entirely new phenomenon as a class since it builds on its predecessors
typically called “Sublinear Algorithms”, “Streaming Algorithms”, etc.
Here is a <a href="http://grigory.us/big-data-class.html#sketch">list of closely related classes offered at some other schools</a>.
In fact, my version of this class consisted of <a href="http://grigory.us/big-data-class.html#lectures">four modules</a>:</p>
<ul>
<li><b>Part 1: Streaming Algorithms.</b> It is very convenient to start with this topic since techniques developed in streaming turn out to be useful later. In fact, I could as well call this part “linear sketching” since every streaming algorithm that I taught in this part was a linear sketch. I find single-pass streaming algorithms to be the most motivated and for so-called dynamic streams that can contain both insertions and deletions linear sketches are known to be almost optimal under fairly mild conditions.
Moreover, linear sketches are the baseline solution in the more advanced massively parallel computational models studied later.
</li>
<li><b>Part 2: Selected Topics.</b> This part became very eclectic, containing selected topics in numerical linear algebra, convex optimization and compressed sensing.
In fact, some of the algorithms in this part aren't even “algorithms for Big Data” according to the RAM size based definition.
However, I considered these topics to be too important to skip in a “big data” class.
For example, right after we covered gradient descent methods for convex optimization Google released <a href="https://www.tensorflow.org/">TensorFlow</a>.
This state of the art machine learning library allows one to choose any of its <a href="https://www.tensorflow.org/versions/master/api_docs/python/train.html#optimizers">5 available versions</a> of gradient descent for optimizing learned models. These days when you can run into some <a href="https://aws.amazon.com/machine-learning/pricing/">pretty steep pricing</a> for outsourcing your machine learning to the cloud knowing what is under the hood of free publicly available frameworks I think is increasingly important.
</li>
<li><b>Part 3: Massively Parallel Computation.</b> I am clearly biased here, but this is my favorite. Unlike, say, streaming where many results are already tight, we are still quite far from understanding full computational power of MapReduce-like systems. Potential impact of such algorithms I think is also likely to be the highest. In this class because of the time constraints I only touched the tip of the iceberg. This part will be expanded in the future.</li>
<li><b>Part 4: Sublinear Time Algorithms.</b> I always liked clever sublinear time algorithms, but for many years believed that they are not quite “big data“ since they operate under the assumption of random access to the data. Well, this year I had to change my mind after Google launched its <a href="https://code.google.com/codejam/distributed_index.html">Distributed Code Jam</a>.
I have to admit that I have no idea how this works on the systems level but apparently it is possible to implement reasonably fast random access to large data.
The problems that I have seen being used for Distributed Code Jam allow one to use 100 nodes each having small RAM. The goal is to process a large dataset available via random access.
</li>
</ul>
<p>Overall parts 1 and 4 are by now fairly standard. Part 2 has some new content from <a href="http://researcher.watson.ibm.com/researcher/files/us-dpwoodru/journal.pdf">David Woodruff’s great new survey</a>. Some algorithms from it are also available in IBM’s <a href="https://github.com/xdata-skylark/libskylark">Skylark library for fast computational linear algebra and machine learning</a>.
Part 3 is what makes this class most different from most other similar classes.</p>
<h1>Mental Notes</h1>
<p>Here is a quick summary of things I was happy with in this offering + potential changes in the future.</p>
<ul>
<li><b>Research insights.</b> One of the main reasons why I love teaching is that it often leads to research insights, especially when it comes to simple connections I have been missing. For example, I didn't previously realize that one can use <a href="http://grigory.us/files/publications/BRY14-Lp-Testing.pdf">L<sub>p</sub>-testing</a> as a tool for testing assumptions about convexity and Lipschitzness used in the analysis of the convergence rate of gradient descent methods. </li>
<li><b>Project.</b> Overall I am very happy with the students' projects.
Some students implemented algorithms, some wrote surveys and some started new research projects.
Most unexpected to me were the projects done by non-theory students connecting their areas of expertise with the topics discussed in the class. E.g. surveys of streaming techniques used in natural language processing and bionformatics were really fun to read.</li>
<li><b>Cross-list the class for other departments.</b> It was a serious blunder on my behalf to not cross-list this class for other departments, especially Statistics and Applied Math.
Given how much interest there is from other fields this is probably the easiest to fix and the most impactful mistake.
Somehow some students from other departments learned about the class anyway and expressed their interest, often too late.</li>
<li><b>New content.</b> Because of time constraints I couldn't fit in some of the topics I really wanted to cover.
These include coresets (there has been a resurgence of interest in coresets for massively parallel computing, but I didn't have time to cover it), nearest neighbor search (somehow I couldn't find a good source to teach from, suggestions are very welcome), Hyperloglog algorithm (same reason), more algorithms for massively parallel computing (no time), more sublinear time algorithms (no time).
In the next version of this class I will make sure to cover at least some of these.
</li>
<li><b>Better structure.</b> Overall I am pretty happy with the structure of the class but there is definitely room for improvement. A priority will be to better incorporate selected topics discussed in Part 2 into the overall structure of the class. In particular, convex optimization came a little out of the blue even though I am really glad I included it.</li>
<li><b>Slides and equipment.</b> I really like teaching with slides that contain only some of the material and use the blackboard to fill in the missing details and pictures.
On one hand, slides are a backbone that the students can later use to catch up on the parts they missed. On the other hand, the risk of rushing through the slides too fast is minimized since the details are discussed on the board. Also a lot of time is saved on drawing pictures. I initially used Microsoft Surface Pro 2 to fill in the gaps on the tablet instead of the board but later gave up on this idea because of technical difficulties. Having a larger tablet would help too. I still think that the tablet can work but requires a better setup. Next time I will try to use the tablet again and post the final slides online.
</li>
<li><b>Assign homework and get a TA.</b> Michael Kearns and I managed to teach “Computational Learning Theory” without a TA last semester so I decided against getting one for my class as well. This was fine except that having a TA for grading homework would have helped a lot.</li>
<li><b>Make lecture notes and maybe videos.</b> With fairly detailed slides I didn't consider lecture notes necessary. Next time it would be nice to have some since some of my fellow facutly friends asked for them. I think I will stick with the tested “a single scribe per lecture“ approach although I heard in France students sometimes collaboratively work on the same file during the lecture and the result comes out nice. When I had to scribe lectures I just LaTeXed them on the fly so I don't see why you can't do this collaboratively.
As for videos, Jelani had <a href="http://people.seas.harvard.edu/~minilek/cs229r/fall15/lec.html">videos</a> from his class this time and they look pretty good. </li>
<li><b>Consider MOOCing.</b> Given that the area is in high demand doing a MOOC in the future is definitely an option. It would be nice to stabilize the content first so that the startup cost of setting up a MOOC could be amortized by running it multiple times.</li>
</ul>
<h1>Thanks</h1>
<p>I am very grateful to my friends and colleagues discussions with whom helped me a lot while developing this class.
Thanks to Alex Andoni, Ken Clarkson, Sampath Kannan, Andew McGregor, Jelani Nelson, Eric Price, Sofya Raskhodnikova, Ronitt Rubinfeld and David Woodruff (this is an incomplete list, sorry if I forgot to mention you). Special thanks to all the students who took the class and <a href="http://www.seas.upenn.edu/~sassadi/">Sepehr Assadi</a> who gave a guest lecture on our <a href="http://arxiv.org/pdf/1505.01467.pdf">joint paper about linear sketches of approximate matchings</a>.</p>
<p><a href="http://grigory.github.io/blog/teaching-algorithms-for-big-data/">Teaching algorithms for Big Data</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on December 24, 2015.</p><![CDATA[Slides and Videos from DIMACS]]>http://grigory.github.io/blog/dimacs-materials2015-10-29T00:00:00+00:002015-10-29T00:00:00+00:00Grigory Yaroslavtsevhttp://grigory.github.io/bloggrigory@grigory.us<p>Slides and videos from the DIMACS workshop “Big Data through the Lens of Sublinear Algorithms”
are now available (<a href="http://grigory.us/mpc-workshop-dimacs.html">link</a>).
In case you missed it, this was a great opportunity to catch up on the latest and hottest results in the field.
We were lucky to have a healthy mix of speakers from both academia and industry (represented by researchers from Microsoft, IBM, Google and Yahoo!). I was particularly excited to see talks on both traditional models for sublinear computation (streaming, property testing, etc.) as well as more recent ones (here my own favorites are MapReduce and other modern distributed models).</p>
<p>All keynotes, tutorials and regular talks were great. Among regular talks let me highlight two that were in some ways outliers:</p>
<ul>
<li>Vahab Mirrokni talked about problems and frameworks for large-scale data mining at Google Research NYC (<a href="https://www.youtube.com/watch?v=w7zc1OpN9gk&feature=youtu.be&list=PLqxsGMRlY6u659-OgCvs3xTLYZztJpEcW">video</a>). I really wish this could be a longer talk.</li>
<li>
Jelani Nelson from Harvard gave a quick tutorial on chaining (<a href="https://www.youtube.com/watch?v=6gfrr5VEbtc&feature=youtu.be&list=PLqxsGMRlY6u659-OgCvs3xTLYZztJpEcW">video</a>). From this tutorial you can also learn about applications of chaining to instance-dependent Johnson-Lindenstrauss dimensionality reduction using Gaussian mean width which I didn't know and found really cool. Jelani is organizing a workshop on related topics at Harvard that will take place on Jun 22–23 (after STOC). </li>
</ul>
<p>Kicking off 2016 is another <a href="http://www.cs.jhu.edu/~vova/sublinear2016/program.html">sublinear algorithms workshop</a> at Johns Hopkins University (Jan 7–9, right before SODA in Arlington, VA).</p>
<p><a href="http://grigory.github.io/blog/dimacs-materials/">Slides and Videos from DIMACS</a> was originally published by Grigory Yaroslavtsev at <a href="http://grigory.github.io/blog">The Big Data Theory</a> on October 29, 2015.</p>