About a month ago, we had a post highlighting some of the work our team has been doing around the Docker open source project, an ambitious program designed to enable applications to seamlessly run on any platform. The focus of that post, and within the Hadoop development community, has been on the Hadoop side of the equation: How do we change YARN to accommodate Docker? However, a good portion of our work has involved contributing changes to Docker so it will fit better into YARN. In this post we’d like to highlight that side of the equation, by describing our work to bring user namespaces into Docker.

Why go to all of this effort? Docker has the potential to provide the next generation of virtualization at an efficiency untouched by today’s hypervisor model.  The integration of Hadoop YARN with Docker will allow multiple clusters to utilize the same hardware resources.  Our work with Docker to provide a rational security model is key to the progress of Hadoop in these environments.  We’re not the only ones that think this way.  As I write, our CEO Raymie Stata is on stage making a cameo appearance during Hortonworks’ founder Arun Murthy’s keynote at this year’s Hadoop Summit.

Linux Containers are built on a collection of kernel mechanisms that create isolated views of the resources managed by the host operating system.  The best known of these are “control groups” (cgroups), which carve out isolated portions of host CPU, memory and IO resources for the container.  Other such mechanisms are network namespaces, which define private network devices for the container’s use.  PID namespaces isolate the processes running in different containers, so they cannot “see” or directly interact with each other.

Linux Containers—and the underlying kernel mechanisms on which they are built—are very general: Each resource managed by Linux is isolated by a separate mechanism.  While this is generally powerful, it’s also complicated.  One of the defining principles of the Docker project is to hide that complexity by choosing a standard configuration for each mechanism that is appropriate for most applications.  We believe that this simplification of Linux Containers best explains the enthusiastic response Docker has received—and is what makes Docker such a good technology to integrate into YARN.

Most of the isolation mechanisms in Linux have been in place since 2007.  However, isolation of user IDs (UIDs) requires a namespace with a complex interface to accurately represent how containerized artifacts with UIDs associated with them (such as users, files and processes) are treated on the host.  It has taken the Linux community a long time to wrestle this complex problem to the ground, so the last bits of UID namespace isolation have made it into the kernel only a couple months ago—years after the containers effort got underway.  As a result, Docker supports isolation of all resources except the UIDs.

UID namespaces are critical to restricting access to the resources of a host.  In the absence of support for UID namespaces, processes in a Docker container running with root privileges can compromise the security of the host or other containers on the host.  We at Altiscale view this lack of support for UID namespaces as a critical limitation of Docker in the context of YARN.  So we’ve decided to help integrate UID isolation into Docker (https://github.com/dotcloud/docker/pull/4572)—for the benefit of both the Docker and the YARN communities.

This integration has been challenging.  As mentioned above, UID namespace is one of the most complex of Linux’s isolation mechanisms.  While it is very powerful, this complexity is at odds with Docker philosophy of simplicity.  Inline with Docker’s spirit of simplicity over generality, our initial patch to Docker only supports root-user remapping, with virtual UID 0 in the container automatically mapped to a well-known user on the host.  All other UIDs on the host, are mapped one-to-one within the container, so that any files or directories mounted into the container from either the host or other containers appear with consistent UIDs.  For security, the host’s root is left unmapped.

UIDs must be addressed in other areas within Docker.  A powerful feature of Docker is the ability to distribute pre-defined container images from the Docker registry, a central repository of images.   An image is a collection of files – and files need to be owned by users.  What should we do about files owned by root (UID 0)?  Our solution is that an image in the Docker registry is always stored in its identity mapping (UID x mapped to UID x), but when it is downloaded for the first time, the files with UID 0 are translated to the “well-known (unprivileged) user” to which a Docker container’s root user is mapped on that host.  Similarly, the image is reverse-translated to its identity mapping before being pushed to the registry.

Stepping back, the larger point here is that off-the-shelf Docker is a good fit, but not a perfect fit, for YARN.  A major step towards perfection is integration of UID namespace isolation into Docker, which is underway.  But Docker needs several other features to be a perfect fit for YARN.  We are actively working through the design of those features, and have published a wiki page to track the progress to date.  Lots of work remains to be done, so we encourage all interested parties to jump in!