Linux System Admins often use cron to schedule recurrent Hadoop jobs. Examples of such jobs might include processing data that has come in during the day to make it ready for analysis the following day, or running a background workflow at times when the cluster is not busy. However, we recommend using Oozie instead of cron for managing workflows in Hadoop. This is because Oozie is specially designed to support Hadoop workloads and offers useful features that cron does not.
For instance, when attempting to run a cron job on the Altiscale Workbench, the job may fail because cron does not automatically pick up the Hadoop environment variables defined on the workbench. This is cron’s default behavior. So, if you haven’t set all the Hadoop environment variables, a distcp job (run via cron) trying to copy from an s3 location to /data on HDFS may fail because it doesn’t have permission to write to that directory.
If you’re nonetheless determined to continue using cron for Hadoop workflow management, you can solve this problem by:
- putting source
/etc/profilein crontab or your cron script prior to job execution, or
/etc/profile.d/hadoop-env.shin your crontab or cron script.
However, we recommend that you save yourself the time and trouble and use the Oozie coordinator instead.
This blog describes in detail additional benefits of using Oozie. Here are the most important:
- Oozie has built-in Hadoop actions, making Hadoop workflow development, maintenance, and troubleshooting easy.
- Oozie is well-integrated with Hadoop security and Kerberos.
- With the Oozie UI it’s easier to drill down to specific errors in the data nodes. Other systems require significantly more work to correlate YARN jobs with the workflow actions.
- Oozie has been proven to scale in some of the world’s largest clusters.
- Oozie coordinator supports data dependency, so it allows triggering actions whenever files arrive in HDFS.
Oozie coordinator’s support for data dependency can be just as important as triggering jobs based on time for many use cases. For instance, imagine your previous day’s data is expected to arrive at 1 a.m. and cron is setup to process it at 2 a.m. However, once or twice a month, your upstream data is delayed. Regardless, cron will still run at 2 a.m. and either fail, or even worse, succeed with partial or no data. This is where the Oozie coordinator’s data dependency comes in handy. You can schedule the coordinator job to kick-off at 2 a.m., but define it to wait for the data to be available before actually running the job. Here are some other interesting use cases for the Oozie coordinator.
Last but not least, if users still want to use cron-like syntax because of its familiarity, Oozie cron syntax allows specification of the coordinator frequency, just as in cron. So you can feel like you’re using cron, but with all the benefits of Oozie.
With thanks to the fabulous Ranjana Rajendran, who contributed to this article.