HPE Ezmeral: Uncut

If HPE Ezmeral Data Fabric is the answer, what is the question?

People don’t ask for things they think cannot be done, but HPE Ezmeral Data Fabric solves problems that many have written off as impossible. In the blog below, I discuss key questions for which this unique data fabric provides the answer.

HPE-Ezmeral-DataFabric-is-the-Answer .jpg

There’s an overarching question about how data infrastructure can make your business work well, a question that sets the context for other specific issues.

The question:

How can I have data at scale--when I need it, anywhere I need it--so I can run the right applications against the right data?

The answer:

HPE Ezmeral Data Fabric, because the right data fabric can change the way you work.

Let me explain. If you design and build infrastructure in a streamlined way at a foundational level, the impact of that foundation ripples out across your entire enterprise landscape, simplifying the challenges you face. HPE Ezmeral Data Fabric is that streamlined foundation. Here’s why:

  • HPE Ezmeral Data Fabric is unique

It’s not just another name for a data lake. HPE Ezmeral Data Fabric does what a data lake should do and more, but without many of the difficulties.

  • HPE Ezmeral Data Fabric is a single system

It’s not just separate components with connectors that are called a data platform. Instead, it is a unified technology, engineered from the ground up as a single system, having a highly scalable distributed file system with built-in event streaming and document style database. 

  • HPE Ezmeral Data Fabric moves data simply

It transfers data or applications easily between edge, core, and cloud through bi-directional replication of tables and streams or via incremental mirroring.

  • HPE Ezmeral Data Fabric works with legacy applications

It is not HDFS, even though anything written for Hadoop can access data directly from the HPE Ezmeral Data Fabric. Surprisingly, applications using conventional file access methods from languages such as Python, C++, GO, or anything else that runs on Linux can also access data directly from the HPE Ezmeral Data Fabric. Even data created using other access methods such as S3 API can be accessed directly from the data fabric.

Now that we’ve laid the foundation of what HPE Ezmeral Data Fabric is, let’s dig deeper to more specific questions that you must answer in order to meet essential business challenges.

1) How can I meet my data service level agreements (SLAs)?

When your applications and business processes matter, you have data SLAs that must be met. It’s critical for data availability and speed of access to NOT be a bottleneck to success. To meet low latency requirements, you need efficient data collection along with speed and ease of data access. HPE Ezmeral Data Fabric is fast because it stretches from data sources across your enterprise landscape, providing efficient parallel access to data wherever you need it. 

The self-healing capability of the data fabric gives you unsurpassed availability and reliability as well. Take the example of Aadhaar card authentication project for Unique Identification Authority of India (UIDAI). This government-supported service requires availability anytime, anywhere in India. It has relied on HPE Ezmeral Data Fabric (previously known as MapR) for years without downtime. 

Meeting SLAs also requires that your application does not collide with other workloads. You don’t want unpredictability due to noisy neighbors. HPE Ezmeral Data Fabric enables efficient, multi-tenancy to make that practical. 

The key lesson here is that you are more likely to meet SLAs when many aspects of data logistics are handled at the platform level rather than the application level. Keep in mind that the data fabric does this for mainstream business systems -- this is not just a specialized system for experimental work with large-scale data. 

Of course, without these advantages, people build workarounds, such as adding more servers and more clusters to try to meet SLAs. The result is a cumbersome, sprawling, expensive system you may want to avoid. That brings us to the next question.

2) How can I scale performance and volume without explosive costs?

Data at large scale presents challenges, but it is possible to work with a system that meets your needs in a practical, cost-effective way today and scales for performance and data volume in the future--without imposing explosive costs. Remember that scale is not just about total volume of data; there’s also the challenge of dealing with huge numbers of files across many locations or very high data rates. 

HPE Ezmeral Data Fabric is engineered to handle trillions of files, thousands of nodes, and hundreds of petabytes in multiple clusters at different geographical locations, whether on premises, in the cloud or in a hybrid architecture. It was built to scale without explosive costs

An excellent example of extreme scale done well is how manufacturers of autonomous cars use HPE Ezmeral Data Fabric to deal with a deluge of data from edge sources and move it as needed to a core data center for additional analysis and long term storage. 


 In the image above, a fleet of test cars, each equipped with many sensors and cameras, uploads up to 5 PB per day to field station clusters running HPE Ezmeral Data Fabric. Data is pre-processed, quality checked, and various analytics are performed before selected data sets are transferred to a core cluster at a data center. This data motion is handled by the HPE Ezmeral Data Fabric, which is also running on the core cluster. Hundreds of PB are retained in the long term and must be rescanned frequently.

Your data scale may not be this extreme (yet), but you should be able to deal with current and future data quantities without extreme costs or having to rebuild and repurpose your infrastructure. One way that HPE Ezmeral Data Fabric helps with this is to automatically manage data placement to control data temperature based on frequency of access - hot, warm, or cold. This automated data tiering optimizes resource use by managing capacity flexibly and efficiently.

3) What about legacy applications and tools -- how can I avoid having to recode everything?

Making use of innovative new approaches for large-scale data should NOT require that you throw away or rewrite everything you’ve already got. With HPE Ezmeral Data Fabric, you don’t have to. It provides a fully read/write highly scalable distributed file system with built-in database and event streaming capabilities. Furthermore, the data fabric allows Linux applications to access data via conventional file access methods (POSIX or NFS). 

An example of the familiarity and simplicity of working with the HPE Ezmeral Data Fabric is shown in the following figure:


The diagram above shows the result of listing directory contents for data stored in the HPE Ezmeral Data Fabric using the standard Linux tool "ls". The contents include files and sub-directories as you might expect, but they also include message streams (accessible using the Kafka API) and a database table.

What does this mean for your organization? You can use existing applications along with new approaches and new data sources, all in one uniform system. Again, remember that the data fabric is the foundation of mainstream business processes. It also means that analysts and data scientists do not have to copy data out of the storage layer over to an analytics environment and then copy it back, as you would with other distributed files systems.

In short, this capability provides superb interoperability for a wide range of applications, languages, and tools. That’s a huge advantage.

4) How can I coordinate data at the edge, in the cloud, and in my on-premises data centers?

Having a data fabric that spans your enterprise landscape from edge to core to cloud or multi-cloud lets you easily put data where you want it or access it from a remote location. HPE Ezmeral Data Fabric does this for you, as one unified system with the same security across these environments. 

With a uniform system across locations, a global namespace, and conventional data access, the HPE Ezmeral Data Fabric makes it easy to access data from a remote location. Application developers do not have to specify location at the application level; instead, they leverage the benefits of handling data logistics at the platform level. 

Data movement is handled by the data fabric in several ways. One way is via bi-directional replication of tables or event streams to different topologies within a cluster or across locations from edge to data center (on-premises or cloud). This data movement is very fast and efficient. 

Another way it handles data mobility is through efficient data mirroring. The data fabric uses a data management unit known as a volume. Volumes look and act like directories. They store files, tables, and event streams all together, as shown in the “ls” figure above. Volumes also are the basis for incremental mirroring between data centers or from edge to core. Once a mirrored volume is established, updating the mirror only requires that the incremental data that has changed be moved, making it fast and affordable. We saw an example of mirroring using the data fabric in the first figure that outlined data movement from field station to data center for a fleet of autonomous cars. 

Keep in mind that IoT sensor data used by big industries, such as automotive, oil and gas, transportation or telecommunication, is not the only way that people need edge capabilities. Your business happens at the edge --where your customer is -- such as point of transactions in physical retail or banking sites or websites. 


As demonstrated in the diagram above, data from credit card transactions is collected on a regional cluster running the HPE Ezmeral Data Fabric and moved (usually by mirroring) to a core data center. This may be on premises, in the cloud, or both. 

Increasingly, businesses are finding it useful to leverage a hybrid cloud plus on-premises architecture as the best way to optimize resources while dealing with critical and spiking workloads. HPE Ezmeral Data Fabric gives you a uniform system across these environments, which is not only more convenient and less prone to errors, it also protects you from vendor lock-in. You choose where you want your applications to run and make sure applications have access to the data they need. Mirroring, table and stream replications, global namespace and automated data tiering all make the data fabric beautifully suited to span a hybrid cloud design.

Mirroring is also used to maintain a disaster recovery twin. The twin is a fully functional cluster that also can be used as a sandbox for model development or experimental projects since it will have a complete copy of data. 

5) How can I solve the biggest problem for AI / ML systems?

Keeping up with the latest algorithm or state-of-the-art tools for AI and machine learning is exciting, but it doesn’t guarantee success in a learning system when you need to meet practical business goals. The biggest problem with AI -- hence the biggest advantage when you address it -- is handling data logistics in an efficient and reproducible manner. A recent IDC study reported 33% to 38% improvement in productivity for data scientists who used the HPE Ezmeral Data Fabric (then called MapR Data Platform). 

The interoperability of the HPE Ezmeral Data Fabric is one key to making AI/ML development more efficient: data scientists don’t have to copy data in and out of the large-scale storage environment to a separate analytics space to do feature extraction, model training, and tuning. The data fabric provides direct data access for leading AI/ML tools and data preparation tools, from Apache Spark to Python, R, Java, TensorFlow, H2O, Pytorch, StanfordNLP -- whatever your data scientists need to use. You don’t have to build a separate system for them.   

Another way the data fabric helps with data logistics is data versioning. Data scientists must preserve exactly the data used for training and be able to access it in the future. HPE Ezmeral Data Fabric provides true point-in-time (non-leaky) snapshots based on data volumes as an excellent way to handle data versioning. 

6) How do I handle the data needs of containerized applications?

Not surprisingly, containerization of applications is a growing trend given the convenience, flexibility, and predictability it provides for running applications in customized environments, where you need them. Dealing with containers requires an orchestration framework, such as the incredibly popular Kubernetes. But containerized applications also need access to a data layer, as input or as a way to persist state, without losing container flexibility. That’s where the right data fabric supplies the answer. HPE Ezmeral Data Fabric orchestrates data in a way that complements the orchestration of containerized computation. It’s like Kubernetes for data. 

HPE recently announced general availability of the new HPE Ezmeral Container Platform, a software platform that makes it easier to work with containerized applications. The HPE Ezmeral Data Fabric provides the pre-integrated persistent storage layer of the container platform. That combination, together with HPE Ezmeral MLOps, gives developers and data scientists a big advantage as they leverage containerization on-premises or in the cloud. 

HPE Discover 2020 on-demand sessions available now!

HPE Discover 2020 is a free, virtual event that started on June 23rd and offered a rich collection of live and on-demand sessions and demos. Now available on-demand, visit this HPE Discover HPE Ezmeral link to view the entire line up of HPE Ezmeral sessions and demos. 

Ellen Friedman

Hewlett Packard Enterprise




Additional Resources:




About the Author


Ellen Friedman is a principal technologist at HPE focused on large-scale data analytics and machine learning. Ellen worked at MapR Technologies for seven years prior to her current role at HPE, where she was a committer for the Apache Drill and Apache Mahout open source projects. She is a co-author of multiple books published by O’Reilly Media, including AI & Analytics in Production, Machine Learning Logistics, and the Practical Machine Learning series.