Altiscale is now part of SAP.

Learn more


A big blog for Big Data.

By | March 8th, 2016 | Analytics, Big Data, Hadoop

Scheduling Jobs Using Cron or Oozie

Linux System Admins often use cron to schedule recurrent Hadoop jobs. Examples of such jobs might include processing data that has come in during the day to make it ready for analysis the following day, or running a background workflow at times when the cluster is not busy. However, we recommend using Oozie instead of cron for managing workflows in Hadoop. This is because Oozie is specially designed to support Hadoop workloads and offers useful features that cron does not.

By | March 3rd, 2016 | Analytics, Big Data, Hadoop

Tips and Tricks for Running Spark on Hadoop, Part 4: Memory Settings

In Part 3 of this blog series, we discussed how to configure RDDs to optimize Spark performance. We outlined various storage persistence options for RDDs and how to size memory-only RDDs. In this blog, Part 4 of the series, we’ll discuss the memory requirements of other Spark application components. Although Spark does a number of things very well, it will not, unfortunately, intelligently configure memory settings on your behalf. So, we’ll outline how to determine how much memory is available for your RDDs and data so that you can adjust the command line parameters and configuration when you launch your Spark jobs.

By | March 1st, 2016 | Analytics, Big Data, Hadoop

How to Identify and Resolve Hadoop NodeGroup Performance Problems Part 1.1: Performance on Hardware Clusters with No Virtualization

In this blog series we’ll discuss the performance of Hadoop NodeGroup for both hardware and virtualized clusters. We, at Altiscale, performed the work we’ll describe as a precursor to our launch of a new initiative that will employ NodeGroup to increase the performance, scalability, and customizability of container-aware Hadoop. This blog, Part 1.1 of the series, introduces and discusses the configuration of rack-aware replica placement with NodeGroup implementation, and evaluates the performance of DFSIO benchmark when NodeGroup is enabled.

By | January 26th, 2016 | Analytics, Big Data, Hadoop

Best Practices for Dynamic Partitioning in Hive

With its proven ability to speed performance, partitioning is a must-have feature in the tool set of any Big Data query engine. Hive is no exception; it has had partition support since its early versions. Although this blog will touch on static partitioning, it will primarily focus on when and how to best employ dynamic partitioning—a method we believe is often underutilized as an effective means for partitioning data and improving performance.

By | January 5th, 2016 | Analytics, Big Data, Hadoop, Security and Compliance

Achieving Regulatory Compliance When Employing Cloud Service Providers

Achieving regulatory compliance can be complicated—especially when using a service provider like Altiscale or Amazon AWS. You may wonder: Is your organization responsible for every aspect of its regulatory compliance when using a service provider? And how does the service provider fit in? Are they responsible for anything and, if so, what? How can your organization make sure your service provider is doing what it’s supposed to?

By | December 17th, 2015 | Analytics, Big Data, Hadoop

Apache Yetus: Faster, More Reliable Software Development

What makes good software? There is no shortage of books, papers, and opinions discussing the topic, and they often agree that two key characteristics are correctness and maintainability. To help teams develop correct and maintainable software, the Apache Software Foundation (ASF) has released the first version of a new, top-level project—Apache Yetus—that’s generating quite a bit of excitement within various Apache communities.

By | December 16th, 2015 | Analytics, Big Data, Hadoop

Altiscale’s HdfsUtils Equals Easier HDFS Management

At Altiscale, we’re constantly working to enhance the customer experience on our data cloud, while also staying committed to Hadoop’s open-source DNA. We smooth out the rough edges of Hadoop and Spark whenever we can to reduce friction and improve time to value. Recently, we’ve done some “smoothing” in the area of HDFS usage management. We’ve developed and are now providing HdfsUtils, a package of tools that helps customers to more quickly and easily manage their HDFS usage.

By | October 20th, 2015 | Analytics, Big Data, Business Intelligence, Hadoop

Hitting a Big Data Wall? How to Improve Hadoop ROI

With the adoption of any new technology, businesses are tasked with measuring success and demonstrating value to the organization — Big Data is not any different. As interest in Big Data grows, the value of the technology platform that delivers it must be accurately calculated and demonstrated to the rest of the organization. Hadoop is increasingly the platform of choice for Big Data, and there is a common misconception that predicting and deriving ROI from Hadoop deployments is too challenging. According to a recent Gartner research report, 24 percent of organizations are not measuring the ROI of Big Data at all.

By | January 7th, 2015 | Analytics, Big Data, Data Science, Hadoop

Practical Tips for Making Hadoop a Productive Environment for Data Scientists

In attempting to work on Hadoop-based data, data scientists face two bad options: use Hadoop indirectly by engaging in a slow and error-prone back-and-forth with “data engineers” who translate your needs into Hadoop programs, or use Hadoop directly by using unfamiliar and unproductive command-line tools that are difficult to master.