Altiscale is now part of SAP.

Learn more
Blog

Blog

A big blog for Big Data.

 
By | December 8th, 2016 | Hadoop

From Counting Words to Counting Lines with Hadoop

What is Hadoop? There are several ways to answer this question when somebody new to the Big Data space throws it at you. Some folks with a delightful sense of  humor might answer it this way: “Hadoop is an expensive and complicated platform for counting words.” You have probably noticed that word-count is the most popular Hadoop example for getting started with the platform and is often the only example found on most online forums. The word-count example captures the essence of Hadoop and the MapReduce paradigm while also being intuitive, simple, and easy to implement.

By | March 8th, 2016 | Analytics, Big Data, Hadoop

Scheduling Jobs Using Cron or Oozie

Linux System Admins often use cron to schedule recurrent Hadoop jobs. Examples of such jobs might include processing data that has come in during the day to make it ready for analysis the following day, or running a background workflow at times when the cluster is not busy. However, we recommend using Oozie instead of cron for managing workflows in Hadoop. This is because Oozie is specially designed to support Hadoop workloads and offers useful features that cron does not.

By | December 16th, 2015 | Analytics, Big Data, Hadoop

Altiscale’s HdfsUtils Equals Easier HDFS Management

At Altiscale, we’re constantly working to enhance the customer experience on our data cloud, while also staying committed to Hadoop’s open-source DNA. We smooth out the rough edges of Hadoop and Spark whenever we can to reduce friction and improve time to value. Recently, we’ve done some “smoothing” in the area of HDFS usage management. We’ve developed and are now providing HdfsUtils, a package of tools that helps customers to more quickly and easily manage their HDFS usage.

By | September 2nd, 2014 | Big Data, Hadoop, HIVE, Uncategorized

Taming a Hive Query Gone Wild

One of my colleagues is fond of reminding people that Hadoop is perfectly happy to let you do bad things at scale. The point being, given the massive amounts of data and computational-capacity available in a Hadoop environment, bad code or mistakes with Hadoop can be much more costly than in traditional data environments.