Altiscale is now part of SAP.

Learn more


A big blog for Big Data.

By | December 8th, 2016 | Hadoop

From Counting Words to Counting Lines with Hadoop

What is Hadoop? There are several ways to answer this question when somebody new to the Big Data space throws it at you. Some folks with a delightful sense of  humor might answer it this way: “Hadoop is an expensive and complicated platform for counting words.” You have probably noticed that word-count is the most popular Hadoop example for getting started with the platform and is often the only example found on most online forums. The word-count example captures the essence of Hadoop and the MapReduce paradigm while also being intuitive, simple, and easy to implement.

By | June 9th, 2016 | Hadoop

How to Identify and Resolve Hadoop NodeGroup Performance Problems Part 2.1: NodeGroup Performance on Containerized Clusters

In the earlier part of this series, we discovered how to achieve similar performance across hardware clusters on the NodeGroup. Then, we started to experiment with NodeGroup on Docker containerized clusters. This blog post builds on the previous two posts in the series of Hadoop NodeGroup Performance to discuss how to identify and resolve Hadoop NodeGroup performance problems.

By | April 27th, 2016 | Hadoop, NodeGroup, Performance

Part 1.2: Investigation, Analysis, and Resolution of NodeGroup performance issues on Bare Metal Hardware clusters.

As we saw in Part 1.1 of our blog series on “How to Identify and Resolve Hadoop NodeGroup Performance Problems on Hardware clusters with no virtualization” once we started to use NodeGroup implementation we observed a performance degradation against the DFSIO benchmark. In this blog, Part 1.2 of the series, we’re going to explain the steps we’ve taken to identify and resolve this problem.

By | March 8th, 2016 | Analytics, Big Data, Hadoop

Scheduling Jobs Using Cron or Oozie

Linux System Admins often use cron to schedule recurrent Hadoop jobs. Examples of such jobs might include processing data that has come in during the day to make it ready for analysis the following day, or running a background workflow at times when the cluster is not busy. However, we recommend using Oozie instead of cron for managing workflows in Hadoop. This is because Oozie is specially designed to support Hadoop workloads and offers useful features that cron does not.

By | March 3rd, 2016 | Analytics, Big Data, Hadoop

Tips and Tricks for Running Spark on Hadoop, Part 4: Memory Settings

In Part 3 of this blog series, we discussed how to configure RDDs to optimize Spark performance. We outlined various storage persistence options for RDDs and how to size memory-only RDDs. In this blog, Part 4 of the series, we’ll discuss the memory requirements of other Spark application components. Although Spark does a number of things very well, it will not, unfortunately, intelligently configure memory settings on your behalf. So, we’ll outline how to determine how much memory is available for your RDDs and data so that you can adjust the command line parameters and configuration when you launch your Spark jobs.

By | March 1st, 2016 | Analytics, Big Data, Hadoop

How to Identify and Resolve Hadoop NodeGroup Performance Problems Part 1.1: Performance on Hardware Clusters with No Virtualization

In this blog series we’ll discuss the performance of Hadoop NodeGroup for both hardware and virtualized clusters. We, at Altiscale, performed the work we’ll describe as a precursor to our launch of a new initiative that will employ NodeGroup to increase the performance, scalability, and customizability of container-aware Hadoop. This blog, Part 1.1 of the series, introduces and discusses the configuration of rack-aware replica placement with NodeGroup implementation, and evaluates the performance of DFSIO benchmark when NodeGroup is enabled.

By | January 26th, 2016 | Analytics, Big Data, Hadoop

Best Practices for Dynamic Partitioning in Hive

With its proven ability to speed performance, partitioning is a must-have feature in the tool set of any Big Data query engine. Hive is no exception; it has had partition support since its early versions. Although this blog will touch on static partitioning, it will primarily focus on when and how to best employ dynamic partitioning—a method we believe is often underutilized as an effective means for partitioning data and improving performance.

By | January 5th, 2016 | Analytics, Big Data, Hadoop, Security and Compliance

Achieving Regulatory Compliance When Employing Cloud Service Providers

Achieving regulatory compliance can be complicated—especially when using a service provider like Altiscale or Amazon AWS. You may wonder: Is your organization responsible for every aspect of its regulatory compliance when using a service provider? And how does the service provider fit in? Are they responsible for anything and, if so, what? How can your organization make sure your service provider is doing what it’s supposed to?

By | December 17th, 2015 | Analytics, Big Data, Hadoop

Apache Yetus: Faster, More Reliable Software Development

What makes good software? There is no shortage of books, papers, and opinions discussing the topic, and they often agree that two key characteristics are correctness and maintainability. To help teams develop correct and maintainable software, the Apache Software Foundation (ASF) has released the first version of a new, top-level project—Apache Yetus—that’s generating quite a bit of excitement within various Apache communities.