Here at Altiscale, we have a diverse set of customers, from media companies to financial services firms to manufacturers. This leads to a diverse set of requirements on the Altiscale Data Cloud, and we develop new solutions to address these requirements. Because of our dedication to the Hadoop community, we contribute these innovations back to the open source community whenever we can. Some of our other contributions include DockerContainerExecutor (at one time the most-watched JIRA), Shell-script rewrite, GraphiteSink, KafkaSink, etc.
|The Old UI||The newer WebHDFS UI|
To understand why this UI was a significant advance, we should look at how the old UI worked.
In the old UI, a client would contact the Namenode HTTP port and request to view a directory. The HTTP server would then create a brand new DFSClient object, contact itself (no kidding), and then render a JSP page, which is finally sent back to the client. The UI itself was also pretty dated, and had at least one XSS vulnerability discovered (hat tip to Derek Dagit) of which we know.
In the new UI, the client requests to view a directory. A WebHDFS server implementation on the Namenode simply returns JSON data to the client. It is then the client’s responsibility to render the data on the user’s web browser. Here we see the advantages of using a well-defined REST service instead of a custom protocol. We also found the UI significantly more responsive and intuitive.
The newer UI was well received and we had been receiving requests to improve it even further. Although it was good to observe the state HDFS was in, users were still not able to modify files and interact with HDFS. Some of the most commonly requested features included:
- Creating directories (mkdir)
- Changing permissions (chmod)
- Changing ownership (chown)
- Setting replication (setrep)
- Deleting files / directories (rm)
- Moving files / directories (mv)
- Pagination and sorting capabilities
We filed an umbrella JIRA for these improvements at HDFS-7588. We were able to work with the community and implement all these features in open source Hadoop.
Here are some screenshots:
|Changing permissions||Changing ownership|
|Changing replication||Creating directories|
The obvious next step was to support file uploading. This turned out to be slightly more complicated than expected.
That tiny little thing called same-origin policy
The WebHDFS protocol has the following steps for accessing a file, when using a client like curl:
- The client first sends a request to the NameNode to read/write the file.
- The Namenode knows which datanodes have the first block of the file. It redirects the client through an HTTP 307 response to one of the datanodes.
- The client then sends exactly the same request to the datanode.
The astute reader will realize that this will not work with web browsers, because they implement the same origin policy. Web browsers will not simply follow the redirect. In fact, they first send a pre-flighted HTTP OPTIONS request (which contains the origin). Only if the server implements CORS and replies with a 200 along with the methods and so on, does the browser send the original request. This is done to protect the user of the web browser from leaking sensitive information.
However, in our case, the problem is further complicated because our original request was an HTTP PUT via XMLHttpRequest . As it turns out, XMLHttpRequest is a living standard, and web-browsers still differ in their behavior when they encounter this. In our testing, Mozilla Firefox (v35.0) did not send the pre-flighted request for an AJAX PUT request. Google Chrome (v37.0) did. Here’s an illuminating discussion about the the issue.
Our only recourse was to change the WebHDFS protocol itself, so that now when an additional parameter (nodirect) is set on the request, the Namenode returns a 200 OK response (in contrast to a 307 redirect), and puts the datanode location in the response. Scripts on the browser then create a new XMLHttpRequest to the datanode. This works well.
We were able to contribute features to make the HDFS web browser fully usable. HDFS users who do not want to wait on the command line can now easily and intuitively get their work done via this interface. For now, these features are slated to be available with Hadoop-3 (although at Altiscale we chose to backport them to our Hadoop-2.7 clusters.) We’d love to hear from you if you found these features useful.
The idea initially was implemented by Travis Thompson and Howard Weingram as part of an internal hackday. From the Apache community, Haohui Mao and Allen Wittenauer were instrumental in getting this integrated. Nina Stawski and Dragana Mijalkovic helped with the front end changes and UI design.