At Altiscale, we’re constantly working to enhance the customer experience on our data cloud, while also staying committed to Hadoop’s open-source DNA. We smooth out the rough edges of Hadoop and Spark whenever we can to reduce friction and improve time to value. Recently, we’ve done some “smoothing” in the area of HDFS usage management. We’ve developed and are now providing HdfsUtils, a package of tools that helps customers to more quickly and easily manage their HDFS usage.
When using a cloud service like Altiscale’s, HDFS usage is typically part of the cost structure. This makes management of that usage particularly important. Altiscale’s HDFS usage alerting feature lets our customers know when they’re running up against their plan limits, but they still must manage their HDFS usage by archiving and deleting data as needed.
Data retention and deletion is a much-discussed topic in the Hadoop space and several very complex data management tools exist that promise the world. However, most of our customers only need the ability to easily find and delete old and large-sized files—a relatively simple but important task. Hadoop provides a “delete” that’s very functional, but to find files on HDFS that qualify for deletion typically requires significant manual work using a combination of “
hadoop fs -ls –R”, “hadoop fs -count”, and some scripting. A “deep clean” using this method could easily cost hours of effort.
In response to customer feedback, Altiscale decided to make this cleanup easier by developing HdfsUtils, a package that includes two handy tools—hdls and hdfind. The fundamental goal of HdfsUtils is to provide users with a “find” command similar to the Unix “find.” To do so, we selected and implemented in hdfind the semantics of the most useful options offered by Unix “find.” Below are some examples of how to use the hdfind tool:
1. List all files and directories in your HDFS under the
/user/joe directory that are 500 MBs in size or larger. Descend to, at most, 4 levels of directories under the
$ hdfind /user/joe -minsize 500M -maxdepth 4
2. List all files and directories in your HDFS under the
/user/joe directory that are 500 MBs in size or larger and have last been accessed at least 40 days ago or earlier. Descend to, at most, 4 levels of directories under the
$ hdfind /user/joe -minsize 500M -atime +40d -maxdepth 4
3. List all files and directories in your HDFS under the
/user/joe directory that are 500 MBs in size or larger and have been modified in the last 40 days. Descend to, at most, 4 levels of directories under the
/user/joe directory. Also print the output in the ls format with the sizes listed in human readable form (ls -h):
$ hdfind /user/joe -minsize 500M -mtime -40d -maxdepth 4 -ls -h
Customers can run these sophisticated find options and save the output to a file to be inspected by the user before eventual deletion. Or, they can pipe the output to a
hadoop fs -rm command so that it can be deleted. As for the hdls tool, think of it as a more advanced version of
hadoop fs -ls with some nice additions. You can run
hdfind -help or
hdls -help to see all the options available and how to use them.
Both hdfind and hdls have been implemented over the WebHDFS REST API that is already part of the Altiscale Hadoop cluster. You may see better performance on some of these HDFS operations with the HdfsUtils as compared to the typical
hadoop fs calls because it doesn’t involve spinning up a Java VM for each command invocation.
Altiscale plans to continue to enhance the HdfsUtils package and welcomes contributions to this code base. We’ve open sourced the project at https://github.com/Altiscale/HdfsUtils and it’s implemented as a Ruby gem. We hope this is another step in making Hadoop more user-friendly and consistent with other known tools and patterns familiar to enterprise-class organizations.