HPE Ezmeral: Uncut

Optimize your big data environment with the RET principle

When big data environments are large, complex, and siloed, IT teams spend more time managing, patching, HPE BlueData-AI-blog.jpgand upgrading systems instead of helping the business solve problems with advanced analytics. But if we can think about the infrastructure in terms of workload using the RET principle, we can gain greater utilization of the infrastructure and keep up with the demands of the business. Watch this lightboard video to learn about the RET principle.

Is your existing big data environment getting out of hand? Is it difficult to manage? Do you struggle to keep up with the demands from the business? If you answered yes to any of these questions, you’re not alone.

Big data infrastructure challenges

A common challenge I encounter when I speak to customers about their existing big data environments is “Hadoop Sprawl.” What does that mean? Identical to the database data sprawl issues of the 2000’s, it means that there are far too many Hadoop clusters throughout the enterprise—which lead to the proliferation of silos. By having disparate environments—such as a lab, dev, UAT/integration and production environments—IT teams spend a majority of their time managing, patching, and upgrading these systems instead of spending that time working with the business to solve difficult problems through advanced analytics.

The bigger problem with having multiple unique environments, however, is data duplication and the inefficient use of infrastructure funds that are used to support these environments. Data duplication across the environments creates a problem of data drift, meaning the data doesn’t stay up to date with the system of reference data. As a consequence, analytics and data science teams get inconsistent results from the different systems, and inefficiencies in the infrastructure hinder productivity as well as erode confidence in the results from their work.

If this problem is so common, cutting across organizations of all sizes and all industries, how is it that we got here? The answer lies in the rapid change of technology in this space outpacing IT’s ability to adapt. The traditional Hadoop stack used to be pretty simple—HDFS for storage, MapReduce for processing, and a few databases for surfacing the data to a limited set of applications. Over the past decade, the number of additional components in the Hadoop stack as well as databases and processing components has exploded. This means that the traditional monolithic tightly coupled architecture of these applications (i.e. deploying them all together) is no longer the most efficient way to deploy these systems. The problem with deploying these systems as monoliths is that every time one of the components needs to be patched or upgraded, the whole system must also be redeployed. As the system becomes bigger and more complex, this process becomes time-consuming and error-prone, resulting in updates happening less and less frequently (or alternatively, resulting in more downtime for the end-users).

A workloads perspective

What’s the alternative? Looking at the world of cloud-native application development, we can take the pattern of decoupling the components of the monolith in a way that is similar to how modern applications are composed of microservices. We don’t need to break apart the system in quite the same way as stateless microservices-based applications, but we can take the concept of decoupling the monolith with the major components. Applications like Spark, Kafka, and Hive can all be independently deployed and scaled based on the needs of the organization. Additionally, by separating the application/compute components from the data/storage components, we can independently scale them.

When deciding which components to decouple, we need to use a set of criteria to determine if they are candidates for this process. I like to use the RET principle for determining if a big data job or application is a good fit: 

  • Restartable: If the job fails, can we restart it without affecting the other users or system?
  • Ephermeral: Is the application created and destroyed on demand; is it short lived?
  • Temporal: Does the job have a well-defined run time?  

This is where thinking about the infrastructure in terms of workloads, instead of monolithic silos, can allow us to get greater utilization of the infrastructure and to keep up with the demands of the business. Check out my video to learn more about applying the RET principle to optimize your big data environment and boost productivity.

Learn more about HPE BlueData Software.

Matt Maccaux
Hewlett Packard Enterprise


0 Kudos
About the Author


As Global Field CTO for HPE enterprise software solutions, Matt brings deep subject-matter expertise in big data analytics and data science, machine learning, blockchain, and IoT as well as cloud, virtualization, and containerization technologies.