A big blog for Big Data.
Analytics, Big Data, Hadoop

Apache Yetus: Faster, More Reliable Software Development

By | December 17th, 2015 | Analytics, Big Data, Hadoop

What makes good software? There is no shortage of books, papers, and opinions discussing the topic, and they often agree that two key characteristics are correctness and maintainability. To help teams develop correct and maintainable software, the Apache Software Foundation (ASF) has released the first version of a new, top-level project—Apache Yetus—that’s generating quite a bit of excitement within various Apache communities.

What is Apache Yetus?

So what, exactly, is Apache Yetus? As described by the ASF, Apache Yetus is “a collection of libraries and tools that enable contributions and release processes for software projects.” Apache Yetus provides:

  • a system for checking contributions to ensure that they meet community-accepted requirements.
  • the ability to document a well-defined, supported interface for downstream projects.
  • tools that help release managers to generate release documentation based on information provided by community issue trackers and source repositories.

The Status Quo: Inconsistent Patch Management

But doesn’t the ASF already have a way to manage contribution and release processes? What about Apache Hadoop’s automated patch testing capabilities?

To understand how Apache Yetus came to be and why we need it, let’s take a look at the history of Hadoop’s patch testing capabilities and how they’re being used today. In 2008, Hadoop committers (the humans involved with actually adding code to the code base) were grappling with challenges (of the good kind!) resulting from an increasingly large and extremely excited contributor community. To help maintain consistency over a large and disconnected set of committers, automated patch testing was added to Hadoop’s development process.

This automated patch testing (now included as part of Apache Yetus) works as follows: when a patch is uploaded to the bug tracking system an automated process downloads the patch, performs some static analysis, and runs the unit tests. These results are posted back to the bug tracker and alerts notify interested parties about the state of the patch. This allows for committers to concentrate on the ideas represented in the patch and not details around, for example, whether it has tabs. Since contributors are alerted right away, they can then iterate on the patch independently.

The end result is huge time savings. So much so that other projects within the Apache community have forked the code and are running their own versions of it, sometimes with changes and bug fixes.

Great! Well, sort of. Those enhancements and bug fixes have rarely propagated to the other forks of the code. So, one Apache project’s patch tester may act radically different than another project’s patch tester. This is not so good and has, unfortunately, been the status quo for a very long time.

A Better Approach: Apache Yetus

This year, Altiscale and several of our ODPi affiliates organized the first ever Hadoop Bug Bash with the goal of reducing the number of open JIRA issues with patches attached. Various behind-the-scenes developments helped in this effort, including a massive rewrite of the patch testing facility used in Hadoop. This rewrite resulted in faster patch testing times, significantly more test coverage, and greater patch reliability.

These improvements didn’t go unnoticed by the other Apache projects. So, the obvious question became: Why not attempt to re-merge all of these great ideas, code fixes, etc., into a single code base that all ASF projects could use? Test coverage is always a challenge and reducing testing time would benefit the entirety of the ASF due to shared infrastructure. Plus, consistent patch testing across projects could only help to increase contributions. In fact, why not share this work in a much more visible way so that non-ASF projects could benefit as well?

Why not, indeed? And so, after many long hours of work and with the addition of other pieces from Apache Hadoop, including a changelog and release notes generator, a bash documentation generator, and interface annotations for Java, the very first release of Apache Yetus is here.

More on Apache Yetus…

The Apache Yetus project addresses much more than the patch testing we’ve discussed here. Stay tuned for Part 2 of this blog series, where we’ll delve into additional aspects of the Apache Yetus feature set.