Congratulations to Michael Stonebraker on winning the ACM Turing Award last week! Michael is recognized for his fundamental contributions to the concepts and practices underlying modern database systems. It is somewhat unfortunate though that the RDBMS community and the MapReduce crowd ended up being split apart after the 2010 CACM articles MapReduce: A Flexible Data Processing Tool (by Jeffrey Dean and Sanjay Ghemawat) and MapReduce and Parallel DBMSs: Friends or Foes (by Michael Stonebraker et al.).
Stonebraker’s criticism of MapReduce/Hadoop started back in 2008 with a post MapReduce: A major step backwards. It has only changed slightly over the last 7 years (see e.g. this). A good example is Michael’s talk at XLDB12 which I found fun and educational. Moreover, I tend to agree with most of waht Michael says except when it comes to Hadoop (e.g. “Hadoop is right at the top of the Gartner group hype cycle”, M.S. 2012) because over the 7 years of his criticism Hadoop has become the most successful open source platform for general purpose massively parallel computing. It probably won’t be surprising to see Dean and Ghemawat winning the Turing Award in the future for making massively parallel computing commonplace.
I believe that both in theory and in practice MapReduce and RDBMSs are apples and oranges in the big data universe. Both are crunching hundreds of petabytes of data these days. However, in my experience some people still think that there is a way to directly compare the two and determine a single winner. I heard about Stonebraker’s criticism of Hadoop so many times and in places so diverse (e.g. from one of my running club buddies on the Penn track as well as during my visit to Princeton in a conversation with one of the professors I highly respect) that I decided to write a blog post about it. I will try to “bite the bull by the horns” (expression courtesy of Ken Clarkson) and summarize the advantages of each paradigm from my experience both in practice and in theory.
Advantages of MapReduce
MapReduce paradigm has emerged as a universal tool for a specific type of parallel computing. I would compare it with a magic hammer that in theory allows you to do almost anything you might want. While a “Swiss army knife” RDBMS solution would certainly be more efficient for specific tasks that it has been designed for the magic hammer of MapReduce works for almost any problem that is possible to parallelize.
- Universality. A big advantage of MapReduce is its universality. In particular, it offers efficient low-level access to the data for a software engineer. This allows to handle completely unstructured messy data. For example, many graph algorithms can be easily implemented in MapReduce, while general purpose databases don't play well with graphs. This is a well-known issue and also the reason why specialized graph databases such as Neo4j exist. While learning how to use MapReduce takes some time and experience with programming in my experience good software engineers can learn it fairly quickly. This is why for best companies such as Google, Facebook, etc. the learning curve and cost of skillful engineers doesn't seem to be an issue.
- Customization. During the 10 years of its existence the base Hadoop layer has been extended by many different frameworks that can run on top of it. Examples of free such frameworks are Spark (greatly improved raw Hadoop efficiency), Apache Storm (streaming support) and others. In particular, most of the Stonebraker's criticism regarding inefficiency of Hadoop is no longer applicable because of these improvements. After the inefficient higher-level Hadoop layers are replaced it is only the HDFS that remains untouched. According to Stonebraker himself: "I don't have any problem with HDFS, it is a parallel file system <...> by all means go ahead and use it". Companies such as Google and Facebook are running their own custom versions of Hadoop/MapReduce and while most of the details are secret we routinely hear in the news about petabytes of user data being crunched daily in such systems. </li>
- Support of your favorite programming language. With Hadoop Streaming one can use any programming language.
- Open source. Apache Hadoop is an easy to learn open source version implementation of the MapReduce framework.
Advantages of Parallel RDBMSs
Database management system technology has been perfected for over more than 40 years becoming a “Swiss army knife”-type solution for big data management.
- Efficient processing of typical queries on relational and some other types of data. For relational data efficiency of parallel RDBMSs is outstanding. I am not aware of successful attempts to beat the performance of RDBMSs on their home turf using general purpose frameworks for massively parallel computing (e.g. by using Hive on Hadoop discussed below). Moreover, specialized database systems exist also for other types of structured data such as graphs (e.g. Neo4j), sparse arrays, etc. Just like with a Swiss Army knife, if a certain application can be directly handled by an RDBMS then it is probably handled pretty well and most common use cases are pretty well covered.
- Simplicity. While this is clearly subjective and might change over time, currently the learning curve for MapReduce users seems to be much steeper than for those who use an RDBMS. Simplicity also means that it costs less to employ data analysts who can work with RDBMSs.
As a theorist I am very excited about the fact that the performance of MapReduce-style systems can be systematicaly analyzed using a rigorous theoretical framework. See my earlier blog post for the details of the formal theoretical model for MapReduce.
It is very exciting to see MapReduce-style algorithms making it into advanced algorithms classes focused on dealing with big data at many top schools. Some examples that I am aware of are:
- “Dealing with Massive Data” by Sergei Vassilvitskii at Columbia.
- “Algorithms for Big Data” by Jelani Nelson at Harvard.
- “Algorithms for Modern Data Models ” by Ashish Goel at Stanford.
- “Models of Computation for Massive Data” by Jeff Philips at the University of Utah.
Moreover, there are clean hard open problems raised by the MapReduce model which would have strong implications for the rest of theoretical computer science including such fundamental its parts as circuit complexity, communication complexity and approximation algorithms as well as more modern areas such as streaming algorithms. For example, a notorisouly hard question (see details here) is: "Can sparse undirected graph connectivity be solved in o(log |V|) rounds of MapReduce? Hint: Probably, no." Resolution of this kind of open questions will not only surprise the practitioners but might also win you a best paper award at one of the top theory conferences (and it is most likely going to be not because of MapReduce itself but because of other deep consequences such a result would have). On the other hand, I am unaware of open questions in databases which would have the same level of appeal to the theoretical community.
A flagship theory conference STOC 2015 together with the 27th ACM Symposium on Parallelism in Algorithms in Architectures (colocated at FCRC 2015) will host a 1-day workshop "Algorithmic Frontiers of Modern Massively Parallel Computation" focused on theoretical foundations of MapReduce-style systems and directions for future research which I am co-organizing together with Ashish Goel and Sergei Vassilvitskii. I will post the details later so stay tuned if you are interested.
While there is room for apple-oranges none of these seem to be successful so far. It seems to be common sense that SQL-on-Hadoop just like an apple-on-orange is not a great idea in terms of performance. Limited success in attempts such as Hive on Hadoop seem to prove this so far. Using low-level programming languages such as C++ with RDBMSs is also possible (see e.g. SQLAPI). However, described above advantages of RDBMSs most likely vanish if you do so.