Altiscale is now part of SAP.

Learn more


A big blog for Big Data.

By | March 3rd, 2016 | Analytics, Big Data, Hadoop

Tips and Tricks for Running Spark on Hadoop, Part 4: Memory Settings

In Part 3 of this blog series, we discussed how to configure RDDs to optimize Spark performance. We outlined various storage persistence options for RDDs and how to size memory-only RDDs. In this blog, Part 4 of the series, we’ll discuss the memory requirements of other Spark application components. Although Spark does a number of things very well, it will not, unfortunately, intelligently configure memory settings on your behalf. So, we’ll outline how to determine how much memory is available for your RDDs and data so that you can adjust the command line parameters and configuration when you launch your Spark jobs.

By | March 1st, 2016 | Analytics, Big Data, Hadoop

How to Identify and Resolve Hadoop NodeGroup Performance Problems Part 1.1: Performance on Hardware Clusters with No Virtualization

In this blog series we’ll discuss the performance of Hadoop NodeGroup for both hardware and virtualized clusters. We, at Altiscale, performed the work we’ll describe as a precursor to our launch of a new initiative that will employ NodeGroup to increase the performance, scalability, and customizability of container-aware Hadoop. This blog, Part 1.1 of the series, introduces and discusses the configuration of rack-aware replica placement with NodeGroup implementation, and evaluates the performance of DFSIO benchmark when NodeGroup is enabled.

By | January 26th, 2016 | Analytics, Big Data, Hadoop

Best Practices for Dynamic Partitioning in Hive

With its proven ability to speed performance, partitioning is a must-have feature in the tool set of any Big Data query engine. Hive is no exception; it has had partition support since its early versions. Although this blog will touch on static partitioning, it will primarily focus on when and how to best employ dynamic partitioning—a method we believe is often underutilized as an effective means for partitioning data and improving performance.

By | December 17th, 2015 | Analytics, Big Data, Hadoop

Apache Yetus: Faster, More Reliable Software Development

What makes good software? There is no shortage of books, papers, and opinions discussing the topic, and they often agree that two key characteristics are correctness and maintainability. To help teams develop correct and maintainable software, the Apache Software Foundation (ASF) has released the first version of a new, top-level project—Apache Yetus—that’s generating quite a bit of excitement within various Apache communities.

By | December 16th, 2015 | Analytics, Big Data, Hadoop

Altiscale’s HdfsUtils Equals Easier HDFS Management

At Altiscale, we’re constantly working to enhance the customer experience on our data cloud, while also staying committed to Hadoop’s open-source DNA. We smooth out the rough edges of Hadoop and Spark whenever we can to reduce friction and improve time to value. Recently, we’ve done some “smoothing” in the area of HDFS usage management. We’ve developed and are now providing HdfsUtils, a package of tools that helps customers to more quickly and easily manage their HDFS usage.