HPE Ezmeral: Uncut

Modernizing your Big Data estate – the time has come

In the technology and data analytics space, I’m continually reminded that the only constant is change.


Our industry loves to innovate. Time and again we innovate to overcome immediate and future challenges – with solutions that address the need for more data, faster analytics, and better architecture. The innovation typically follows a trajectory of a ground-breaking new innovation, followed by years of incremental improvements to mature the offering to make it applicable to the masses. While these stepwise incremental changes are usually easy to incorporate, the problem is you have to implement the ground-breaking new innovation first. This transition usually requires process changes, training, re-architecture, and often a long painful migration. Ultimately, this leads to the technology hype cycles where businesses individually assess whether and when the risk and struggle to make a change is worth the rewards of newer technology.

Your Big Data estate needs modernization

Hadoop is a great example of both sides of this phenomenon. Several years ago when I was working at Teradata, Hadoop was the new innovation on the block. It came in fast and furious as the Enterprise Data Warehouse (EDW) killer in the early 2010s. The messaging and immature technology created confusion for many enterprises; but some early adopters cut their teeth on Hadoop and made it work. Over the years, the technology matured to the point that (nearly) everyone had a Hadoop-based data lake running in their data centers. Fast forward to 2020, and now Hadoop is on the other end of the technology cycle. The Hadoop ecosystem chugged along and evolved over the past decade, but there have been several new technology innovations in the meantime.  The time has come to embrace these new innovations – and modernize your Big Data estate.

From my perspective there are four major ‘must have’ technology developments that impact the Big Data estate for enterprises today: 

  1. Containerization and Kubernetes are game changers: Containers (and Kubernetes orchestration) can deliver a lot of benefits for Big Data environments. With containers, you can separate compute and storage; to right-size your solution, drive greater efficiency, and optimize the utilization of your compute. Containers also allows you to embrace the constantly evolving ecosystem of open-source tools, enabling your data analysts and data scientists to spin up their tools of choice in minutes while getting access to the data they need. Plus, you get application portability, flexibility, and agility: allowing your data-intensive apps to be quickly and easily deployed on-premises or in any cloud. 
  2. Data is everywhere – on-prem, hybrid cloud, multi-cloud, and at the edge: Originally the Big Data estate for most enterprises was planted firmly on-premises. But more apps are being deployed in the public cloud, and often on multiple public clouds. And with improvements in compute power at the edge, the ever-increasing volume of data generated at the edge, together with network improvements, you need to be thinking about your data globally: from edge to cloud. Your next Big Data platform needs to able to adapt to the needs of your business and your data everywhere: with the flexibility for on-premises, hybrid cloud, multi-cloud, and edge computing deployments.  
  3. The open source ecosystem continues to evolve: Enterprises need to future-proof their Big Data investments; as I mentioned above, the only constant is change. Over time, some vendors have focused on the pure open-source model – whereas others have provided value-add commercial software built on open-source technology. Turns out both approaches are right: you’re going to want optimized tools from your solution provider when it makes sense, but your future Big Data estate also needs to evolve with the speed of open-source innovation. By implementing a solution with the ability to deploy any open-source framework, you can be prepared for this constant evolution while giving your data scientists access to the latest open-source toolkits (including the latest versions of Apache Spark, Apache Kafka, TensorFlow, Kubeflow, Jupyter notebooks, and more).
  4. Make the infrastructure invisible – while ensuring performance, resiliency, security, and high availability: I remember a conversation about Hadoop with a CTO circa 2017 … I was working at EMC at the time, and we were discussing the benefits of NVMe and shared enterprise storage to improve the performance, efficiency, and resiliency for analytics-focused data lakes. His response was: “You’re all about infrastructure, we don’t care about the infrastructure.” I’ve since embraced this mantra (after all, data science teams don’t want to have to worry about the underlying storage, compute, and networking), but infrastructure is still very important. We can hide the complexity of the infrastructure, making app deployment as easy and as seamless as possible. But how many times have you seen a pilot project fail in production? If you don’t architect your solution to ensure security, performance, and other enterprise-grade requirements, it won’t make it in production – and ultimately, it won’t deliver business value.

When I reflect now, I’ve realized that what he meant was, ‘That’s going to be hard to take advantage of unless we re-architect a whole bunch of stuff, and right now the risk isn’t worth the reward’. Luckily for your next Big Data solution, there are many options to consider beyond the stone age days of triple+ replicated DAS running on storage-bound pizza boxes. First off, decoupling compute and storage is table stakes for your next solution – it will make large-scale enterprise deployments easier, faster, and more cost-effective. Next, there are a plethora of prime-time ready infrastructure advancements that your solution architecture should be capable of taking advantage of: NVMe for performance, accelerated compute for analytics and AI /ML, and shared enterprise storage like a NAS or an object store for exabyte-scale price / performance.

Is the risk worth the reward?

Cloudera, the most popular Hadoop distribution, is trying to keep up with these changes and they’ve been working on modernizing their solution for quite some time now. The new Cloudera has been busy integrating the Hortonworks (HDP) and Cloudera (CDH) code bases into the Cloudera Data Platform (CDP). After years of fighting containerization and compute/storage separation, Cloudera is now embracing these innovations to enable multi-cloud app deployments. And they’ve realized that many enterprises are moving away from HDFS; so now they’re moving to the upstart of distributed object storage, Apache Ozone.  

With HDP and CDH nearing end of life, enterprises that have built their Big Data estates on these technologies are at the crossroads of an imminent major software upgrade and full data migration.  So now is the time to ask the difficult questions: 

  • Am I currently getting the value I was expecting from my data lake? What extra value do I get when I upgrade?
  • What will the integrated solution look like? What features / apps will still be there?
  • What is the roadmap? Will it change if Cloudera is acquired?
  • Do I have to upgrade? How do I do it? How long will it take? How much will it cost? When do I lose support on my current version?
  • Will I be locked into Cloudera’s proprietary apps? How easy is it to bring in the latest open-source tools that my data science teams want?
  • Is Apache Ozone ready for primetime? Should I trust it with my data?
  • Is the risk worth the reward or should I consider another strategic solution (and another strategy partner) to modernize my Big Data estate?

Hewlett Packard Enterprise can help

We realize that enterprise organizations – and their business-critical, data-intensive applications – are caught in this storm of uncertainty and change. Unfortunately, there’s no easy button since each organization has its own requirements. But HPE can help customers navigate this process and we have the complete portfolio of solutions, expertise, and support to help you modernize your Big Data estate. 

To start with, we created our HPE AMP Assessment Program. The goal is to help clients answer these difficult questions and de-risk the modernization of their Big Data information estate. With this offering, HPE will deeply Analyze your current-state platform, provide a detailed Map to modernize your current platform in a way that will meet the business needs of your organization, and finally, Prescribe a systematic plan to get you there. And as the output from the AMP Assessment, HPE has our entire software, hardware, and services arsenal to deliver the right solution for your specific needs.

The HPE Ezmeral Container Platform (formerly BlueData) has been providing the ability to containerize Big Data environments – and deliver on the benefits of compute/storage separation – since 2014. The platform is QATS certified by Cloudera for both HDP and CDH, providing a secure and powerful containerization solution that enables agile application development for data-intensive workloads. The result is dramatically faster deployments – from months to minutes – giving data science teams the agility they need to innovate faster and leverage the open-source ecosystem to get the most value from their data.  

Our container platform also allows organizations the flexibility to deploy Cloudera and/or open source apps on-premises, in the public cloud, or in a multi-cloud environment – including machine learning workloads with our HPE Ezmeral ML Ops solution. There’s no application lock-in: customers can add their preferred stack to the platform and innovate at the speed of open source by adding their own container images to the built-in “app store” in a matter of minutes.

Moreover, our HPE Ezmeral Data Fabric (formerly MapR) offers a feature-rich alternative to upgrade and modernize any Big Data environment – with high availability, complete data protection, and improved performance / scalability. You can learn more about what’s new with HPE Ezmeral Data Fabric here. It’s built from the ground up for edge to cloud deployments and is available as a stand-alone solution or packaged with HPE Ezmeral Container Platform to deliver on the benefits of Kubernetes and containerization.

We also have our HPE PointNext consulting, advisory, implementation, and support services with the expertise of 10,000+ installs to deliver on your vision. And our HPE GreenLake as-a-service offerings provide choice on how to consume our solutions, including new pay-per-use cloud services for Containers and Machine Learning Operations. And best of all, these solutions can be fully managed for you by HPE. There’s no patching, performance tuning, or maintenance – you just get to focus on what you do best, your business.

To learn more

If you’re looking for straight talk on this topic, I’m hosting a Modern Big Data Solutions Roundtable webinar next week to get the industry perspective from three of our Big Data CTOs.  I’ll be asking questions about Cloudera upgrade/migration paths, Kubernetes and containerization, open-source options and tools, and hybrid cloud / multi-cloud deployments.  Please join us live on September 1st at 8 AM Pacific / 11 AM Eastern.

Featured articles:


Matt Hausmann


0 Kudos
About the Author


Over the past decades, Matt has had the privilege to collaborate with hundreds of companies and experts on ways to constantly improve how to turn data into insights. This continues to drive him as the ever-evolving analytics landscape enables organizations to continually make smarter, faster decisions.