Last week, the Open Data Platform (ODP) was announced and written about in the technology and business press. While Altiscale has received strong support for its participation in the ODP, we have also received many questions about why the ODP is necessary and the exact role that it would serve. In this blog post, we want to explain more deeply the problem that the ODP is out to solve. In a subsequent post we’ll discuss why we think that a new organization complementary to the Apache Software Foundation (ASF) is the best way to solve this problem.
“Hadoop” is not a single thing. It’s an ecosystem.
To better understand the challenge that the ODP is addressing, it’s helpful to take a look at the following diagram from Hortonworks, which illustrates the contents of their distribution of the Hadoop ecosystem, which is called the Hortonworks Data Platform (HDP).
For now, focus on the labels along the horizontal axis. Here you’ll notice the various components that make up this Hadoop distribution. The distribution includes core Hadoop itself (labeled “Hadoop & YARN”), plus many other components like Hive, Spark, and Kafka – an ecosystem of twenty components meant to interoperate with one another. Most consumers of “Hadoop” use a large cross-section of this ecosystem.
Within the ASF, each of these components is managed as what the ASF calls a “project.” A project consists of a global community of developers working together. Some are paid by distribution providers like Cloudera and Hortonworks, while others are paid by consumers of Hadoop like Facebook and Yahoo!, and others are simply independent volunteers looking to do some interesting programming. The ASF does an amazing job of coordinating the work of this diverse community of developers. However, it’s important to note that each project at the ASF is governed independently, releasing a project update on its own schedule and against its own set of priorities.
This ecosystem needs to be released.
Since each ASF project is governed independently, somebody needs to take responsibility for releasing the ecosystem as a whole. That is, they take responsibility for ensuring that all of these components will work well together. Historically, that has been the role of distribution providers like Cloudera and Hortonworks.
Looking back at the vertical axis of the diagram, you see three different versions of Hortonworks’ distribution of Hadoop: HDP 2.0, 2.1, and 2.2. Inside the chart itself, you see which version of a component is included in which version of the distribution. For example, HDP 2.0 includes version 2.2.0 of Hadoop and version 0.12.0 of Hive, while HDP 2.2 includes version 2.6.0 of Hadoop and version 0.14.0 of Hive.
Note that the component version numbers from one release of HDP to another don’t change uniformly. For example, from HDP 2.0 to 2.1, Apache Hadoop increased by two minor version numbers (from 2.2.0 to 2.4.0), while Apache Sqoop didn’t change at all. Note also that new components are introduced regularly. For example, Tez was introduce in HDP 2.1, and Spark in HDP 2.2.
In short, Hortonworks is making critical decisions about which basket of projects, at which level of maturity, to include into each release of the HDP platform. The other distribution providers, like Cloudera, Pivotal, and IBM, are facing the same decision points but are making different decisions. Whereas Hortonworks used Hadoop versions 2.4.0 and 2.6.0 in its most recent releases, Cloudera used 2.3.0 and 2.5.0.
So what is the problem?
Let’s say you’re the internal applications team of an enterprise or an independent software vendor, and you’re writing an application that runs on top of just two of these components: core Hadoop and Hive. You need to test that your solution works reliably, and then get it certified from the Hadoop distributions that your customers are using.
To certify on Hortonworks you need to test on at least Hadoop 2.2.0 and Hive 0.12.0, and Hadoop 2.4.0 and Hive 0.13.0. For Cloudera’s CDH 5.1, turns out the combinations would be Hadoop 2.3.0 and Hive 0.12.0, while for CDH 5.2 it would be Hadoop 2.5.0 and Hive 0.13.1. For Pivotal and IBM you’d be looking at slightly different combinations again.
This proliferation of baskets creates significant drag when it comes to building reliable applications running on top of Hadoop. All of these shifting complexities, across many, many customers, makes it harder for customers to assess which basket of Hadoop that they need and harder for application developers to create solutions that work broadly.
Let’s work together for stability and innovation.
This proliferation of baskets doesn’t serve customers well, since it slows down solution selection and innovation. The Open Data Platform seeks to address this problem. The ODP is a group of leading organizations that have come together to define a coordinated release of multiple components in a single, well-defined, and agreed-upon core platform. It will provide a reliable, tested base of core functionality upon which applications can thrive and spread. And since it will be providing core functionality only, it will leave plenty of room open for an expanding universe of feature creativity and variety.
Application developers like SAS are excited about the ODP because it will provide the reliability, interoperability, and stability that they need to go further, faster. Hadoop technology providers like Pivotal are enthusiastic because it will allow them to invest more deeply in their differentiated components (for example, Pivotal’s HAWQ query processor). The result will lead to even greater adoption of Hadoop, particularly in the enterprise, and make it possible for even more application depth and breadth.

Recent Comments