A big blog for Big Data.
Big Data, Hadoop, Hadoop as a Service

The Open Data Platform: Uniting for an Enterprise-Class Hadoop Ecosystem

By | February 17th, 2015 | Big Data, Hadoop, Hadoop as a Service

This morning, a coalition of fourteen leading technology organizations announced the creation of the Open Data Platform (ODP), an industry association dedicated to accelerating the adoption of enterprise-class, big data applications that are based on the Apache Hadoop ecosystem of solutions. We at Altiscale are proud to be part of this initiative.

One of my roles at Yahoo! was to champion and defend the company’s participation with the Apache Software Foundation (ASF) from in-house critics. There were Yahoo! business leaders who were concerned that we were enabling our competition, lawyers who were concerned we were putting intellectual property at risk, and even engineers who felt that the “Open Source tax” – their term for the overhead that comes with the ASF governance model – was slowing down development. Fortunately, Yahoo! leadership was receptive to my arguments that the benefits of developing Hadoop, not just as Open Source, but specifically at the ASF, outweighed these costs.

In short, I have always believed that the ASF is the right home for Hadoop. The ASF provides critical governance and leadership for the development of new Hadoop capabilities. Due to the sustained efforts of the ASF, Hadoop is on its way to becoming an integral part of standard IT infrastructure, and many product and service companies have committed themselves to leveraging the Hadoop ecosystem to create value for their own customers.

However, Hadoop has always been dependent upon sophisticated test suites, testing infrastructure, and testing expertise that goes beyond what can be done internally at the ASF. Even today, for example, the testing efforts of Yahoo! continue to be vital to the reliability of Hadoop. Yahoo! tests Hadoop using data sets and use-cases it could never release publically. The same is true for testing done by Hadoop vendors and other large Hadoop users. To date, this testing has been done in an uncoordinated way across different companies. This fragmentation of effort is inefficient, slowing the progress of Hadoop.

In addition, Hadoop today consists of a growing ecosystem of complementary projects that go well beyond the core Hadoop project itself. This ecosystem includes projects like Hive, Ambari, and Spark. The ASF focuses on the releases of individual projects: each project in the ecosystem has its own release schedule, independent of the others. There is no such thing as an ecosystem-wide release, yet most enterprises wish to leverage the capabilities of multiple components at the same time. As a result, each Hadoop technology vendor releases a different “basket” of component-versions. For both internal development teams at enterprises and third-party solution vendors developing applications on top of Hadoop, the proliferation of “baskets” is a major problem.

This is where the Open Data Platform (ODP) comes in. The ODP will define, test, and certify a standard “ODP Core” of compatible versions of select Apache Software Foundation (ASF) projects. This core will provide a proven base against which the broader network of Hadoop-related companies can certify solutions. The ODP will also coordinate the testing efforts of its members around this core, consolidating what is now a very fragmented effort. In addition, the ODP will support community development and outreach activities, and will publish business-focused and technical papers that help clarify and accelerate the rollout of modern data architectures that leverage Hadoop

We believe that this effort will have a dramatic impact in reducing R&D costs, improving interoperability, reducing customer confusion, and bringing the benefits of Hadoop to a broader range of customers than ever before. We’re excited to be part of this initiative.