Big Data works best in the cloud. That was my belief when I joined Altiscale, and my experience over the past year—watching the Big Data market and holding conversations with users and influencers—confirms that this is indeed the case.
You could sense cloud getting more attention at the major Big Data conferences this year. At both Hadoop Summit and Strata+Hadoop World, cloud featured prominently in several keynotes and demos on the main stage. Analysts, who have a pulse on the market from their interactions with clients, corroborate the view that the majority of Big Data intention is moving to the cloud. Gartner reports that inquiries about Hadoop in the cloud have increased 37.5% YoY in Q1 of this year, and estimates that cloud deployments already account for 47% of Hadoop revenue and continues to grow. So why this burgeoning interest in cloud-based Big Data?
Big Data Forecast: Mostly Cloudy
Cloud has many natural advantages over traditional, on-premises deployments for Big Data workloads. For a start, cloud-based Hadoop is much easier and faster to set up. Hardware requirements can be onerous at Big Data scale, and cloud deployment immediately takes procurement, installation, and maintenance of infrastructure off the table. Current on-premises Hadoop users often turn to cloud when faced with the cost and effort involved in upgrading existing clusters, while those new to Hadoop have shown a willingness to go directly to the cloud as an option to accelerate their Big Data programs.
Cloud deployment also offers far greater resource efficiency, avoiding the low compute utilization that plagues on-premises clusters. Organizations commonly need to grow on-premises clusters to accommodate data growth but are made to add nodes that scale storage and compute at the same rate, so hardware tends to be heavily underutilized in these cases. Cloud-based Hadoop solves this issue by separating compute from storage, so that each can be scaled independently of the other.
Another limitation of the on-premises model is the requirement that clusters be sized for peak workload. This can be particularly wasteful for workloads that are not consistently utilizing compute capacity. In contrast, cloud Hadoop deployments handle uneven workloads elegantly through their ability to quickly expand and contract to support changing compute requirements.
Hadoop in the Cloud ≠ Cloud-Optimized Hadoop
Supposing the benefits of cloud for Big Data have won you over, why shouldn’t you just move your existing on-premises setup onto your choice of Infrastructure-as-a-Service providers like Amazon? Traditional Hadoop distributions, quite simply, are not designed to run well in the cloud. While cloud infrastructure offers the potential to adapt to different workloads, only cloud-optimized Hadoop provides the mechanism to fully exploit this elasticity. Traditional Hadoop distributions grow and shrink based on manual or scheduled actions, whereas cloud-optimized Hadoop can scale automatically and, ideally, transparently to users. And when making decisions to add or remove compute capacity, cloud-optimized Hadoop is able to go beyond examining infrastructure metrics to take Hadoop metrics as inputs as well.
Cloud-optimized Hadoop also offers considerable cost advantages. Traditional Hadoop distributions have per-node pricing models that are incongruent with elastic cloud usage. Users may not have to provision cloud infrastructure for peak workload, but they will still have to license software for peak node usage in the traditional Hadoop model. Cloud-optimized Hadoop offers usage-based pricing that truly allows organizations to be cost-efficient in the cloud.
With the secular trend towards cloud-based Big Data, organizations contemplating Big Data deployments would be well served to consider options that are native to cloud from a technical and cost perspective. As a leading Big Data cloud provider, Altiscale has been committed to making Big Data run as efficiently as possible in the cloud since day one.