CIS 700: algorithms for Big Data


  • What? This class will give you a biased sample of techniques for scalable data anslysis. Target audience are students interested in algorithms, statistics, machine learning, data mining and related areas.
  • Who? Grigory Yaroslavtsev.
  • When? Fall 2015, MW 10:30 – 12:00
  • Where? Towne 307.
  • Need permission? Please, send me an e-mail with relevant coursework you have taken. In most cases permission will be granted.
  • Office hours? By appointment.
  • Prerequisites? There are no formal prerequisites, but you should be familiar with the basics of algorithm design and analysis, discrete mathematics, probability and have general mathematical maturity.
  • What's next? Consider starting a research project using techniques you learned in this class. You can also try some open problems:


Increasingly large datasets are being collected by governments, private companies and research institutions. This motivates increased interest in the design and analysis of algorithms for rigorous analysis of such data. In this class we will consider algorithms for scenarios when the size of the data is too large to fit into the main memory of a single machine. Two main paradigms of computation that we will focus on are massively parallel computation (applicable to frameworks such as Yahoo!'s Hadoop, Google's MapReduce, Apache Spark and Microsoft's Dryad) and streaming algorithms (Apache Storm and Spark Streaming).

This class covers a rapidly evolving area. There is no textbook. Some related classes at other universities:

* Usually there is an abstract here.


This class will focus on fundamental principles of algorithm design for big data processing. Most of these principles are very robust to the choice of a specific platform, system or computational model. Thus, the lines drawn between different parts of this class are sometimes blurry and only serve as an approximate guideline.


Deadline: December 18, 2015. Details about the project: [Slides (pptx, pdf)].