What is Hadoop? There are several ways to answer this question when somebody new to the Big Data space throws it at you. Some folks with a delightful sense of humor might answer it this way: “Hadoop is an expensive and complicated platform for counting words.” You have probably noticed that word-count is the most popular Hadoop example for getting started with the platform and is often the only example found on most online forums. The word-count example captures the essence of Hadoop and the MapReduce paradigm while also being intuitive, simple, and easy to implement.
What about counting lines, not words?
That said, I was pulled recently into a conversation with a customer who needed to calculate quickly the number of lines in a large dataset. The customer was looking at a large number of uncompressed text files dumped into a HDFS directory. Two smart engineers from the customer’s team started brainstorming how to accomplish this. One engineer suggested that they write Java code where the mapper counts the lines in individual files and send it to the reducer, which then totals them all. The other engineer suggested that this should be a Python script. It slowly dawned on them that it would take quite some work to go from doing word-count in Hadoop to doing line-count. Part of me was curious if they were beginning to wonder whether Hadoop was indeed a one-trick pony.
A simple answer using Unix commands
I then suggested they run a Hadoop streaming job to accomplish this task using simple Unix commands. The following command did the trick by basically gluing together our good old “cat” and “wc” commands via the Hadoop streaming framework. Here’s the command that got the job done:
hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-*.jar -input data_in -output data_out -mapper "/bin/cat" -reducer "/usr/bin/wc -l"
This command read all the files from the data_in directory and generated the output. The output calculating the total number of lines was stored in the data_out/part-00000 file (this is always the convention with any map-reduce job). Our customers were surprised by the simplicity of this approach and were happy to kick this off within a matter of seconds.
The lesson: Hadoop streaming may be the easy ally you need
This line count approach is simple and shows how to leverage the Hadoop streaming framework to run basic Unix commands, just like we would on the local machine but in a distributed fashion to accomplish MapReduce tasks. If basic scripting can accomplish some task for you locally, always look at Hadoop streaming as a mechanism to implement it in Hadoop before re-inventing the wheel with custom code. Streaming can be a friendly ally in the Hadoop world for many tasks.
Think you have a better way of doing line-count with Hadoop? Share it with us!