HPE Ezmeral: Uncut
Ellen_Friedman

HPE Ezmeral Data Fabric: A sneak peek at what’s coming in 6.2

Something very good just got even better.

800x533 GettyImages-1017686010_1600_0_72_RGB.jpg

A new version of the HPE Ezmeral Data Fabric is slated for release soon, and I'm excited to reveal some features and capabilities users will find particularly valuable. This blog post highlights a sampling of what’s coming in version 6.2. To appreciate what’s new, let’s first take a quick look at what the data fabric already is, for those not yet familiar with it.

What is HPE Ezmeral Data Fabric?

HPE Ezmeral Data Fabric, part of the HPE Ezmeral software portfolio, makes use of the innovative technology originally developed as the MapR Data Platform (MapR Technologies was acquired by HPE in 2019). As a unifying software technology, the HPE Ezmeral Data Fabric provides highly scalable data storage, access, management, and movement across your enterprise from edge to cloud, all within the same security system and with superb reliability. In addition to running on bare metal, the data fabric also provides the data layer of the new HPE Ezmeral Container Platform--for pre-integrated persistent container storage with Kubernetes.

HPE Ezmeral Data Fabric (formerly MapR Data Platform) is the best-in-class data fabric for AI / ML and data analytics from edge to cloud, and it provides the integrated data layer for the HPE Ezmeral Container Platform and HPE Ezmeral ML Ops.HPE Ezmeral Data Fabric (formerly MapR Data Platform) is the best-in-class data fabric for AI / ML and data analytics from edge to cloud, and it provides the integrated data layer for the HPE Ezmeral Container Platform and HPE Ezmeral ML Ops.

In short, the HPE Ezmeral Data Fabric lets you run the right application at the right time on the right data. In order to do this, the data fabric uses a global namespace and provides direct access to data via standard APIs. A wide variety of tools can access the data fabric for processing and analyzing large-scale data or for developing and deploying AI and machine learning models. Containerized and non-containerized applications can access and store data with high performance in the data fabric as files, tables, or event streams. Efficient, incremental mirroring capability of the HPE Ezmeral Data Fabric makes it easy to move data within or between clusters, including geo-distributed locations. Data movement is also done quickly and efficiently via bi-directional, multi-master table, or event stream replication. 

And with the upcoming release, the HPE engineering team has extended this excellent foundation to provide new capabilities described below.

New Features in 6.2

HPE Ezmeral Data Fabric is excellent, but it is HPE’s goal to continue to improve this technology. The capabilities in the 6.2 release are ones that customers have been asking for, including the four valuable new features highlighted in this blog post:

  • Snapshot restore
  • Fine-grained data placement for heterogenous nodes
  • Policy-based security for better data governance
  • Last access tracking

In addition, the 6.2 release has ongoing improvements for data access by containerized applications, even better resilience at scale through FS supportability and GFSCK enhancements, erasure coding performance and fast rebuild, better management via FS metrics, and security advancements through external key management.

Let’s look at some of the key new capabilities in more detail.

Snapshot Restore

Data management in HPE Ezmeral Data Fabric is organized via data volumes--essentially management units that act like directories with superpowers. You can easily set control policy and initiate data mirroring, at the volume level. The data fabric also gives you the ability to make true point-in-time snapshots of volumes--scripted, manually, or on a schedule. These data snapshots are particularly useful not only as a safeguard against human errors but also as a way to handle data versioning. The latter addresses a particular need in machine learning and AI

With the upcoming 6.2 release, HPE Ezmeral Data Fabric also will provide full Snapshot Restore as an option. This very convenient capability is somewhat surprising for large-scale distributed systems where selective restoration is more the rule. Snapshot restore is good news for the system administrator and DevOps administrator because it means that you can restore the entire state of a volume at once. This can be important in case of catastrophic errors, but there are many uses in development and QA where you may need to reset to a precise known state before starting a test. With Snapshot Restore, doing that is about as easy as pushing a button. 

Fine-grained data placement for heterogeneous nodes

A short descriptive title such as this sub-header does not do full justice to the potential impact of this new feature. This new feature makes data placement even more fine-grained such that you can optimize resource use and improve performance. Here’s how.

If you have nodes with a mix of different kinds of storage, such as both fast, solid-state storage (SSDs) and slower spinning media (HDDs) in the same machine, you may want to control which data is located on which type of device. Otherwise, you’re not getting the full advantage of the heterogeneous nodes.

HPE Ezmeral Data Fabric has always enabled you to conveniently place data down to the level of individual machines. The changes in 6.2 extend your choices to the level of different storage devices within machines. The following diagram shows how you would apply labels for fine-grained data placement and why this is an advantage for optimizing both usage efficiency for storage resources and desired performance.

HPE-Ezmeral-Data-Fabric-Fine-Grained.JPG

 

The new fine-grained data placement in HPE Ezmeral Data Fabric makes it possible to match application data latency requirements with media type, simply by applying the appropriate label. This means latency-sensitive data can be persisted in faster media, while bulk analytical data can be placed on slower media, making the most of your resource usage and delivering fast performance as needed.

Another advantage of the new granularity of data placement down to the level of storage pools is even better performance. To understand why, you’ll need a little background information. HPE Ezmeral Data Fabric uses a large unit of data storage, known as a data fabric container (not to be confused with a Kubernetes container, despite the similarity in the name) as the unit of replication. Data fabric containers and their replicas will automatically be placed according to the data placement policies you apply. 

Data fabric also has a special container, known as a name container, which holds metadata for the files, directories, tables, and event streams for the volume. The name container is an inherent strength of the HPE Ezmeral Data Fabric design that is well known to those familiar with the MapR Data Platform because it provides a way for metadata to be distributed across a cluster, resulting in extreme reliability and high performance.   

With the finer granularity for data placement that is new in 6.2, data fabric containers and their replicas can have one placement policy while the name container can have a different policy. As the previous figure shows, you can apply a label “Warm” to position data for bulk workloads on storage pools with slower devices while maintaining the metadata for that volume on fast solid-state devices by applying the label “Hot” to the name container. 

Keep in mind that data may be assigned labels in order to address multiple goals. For example, you could assign a data volume with a topology label to control its placement relative to the failure domains in your cluster and also assign the same data volume a label that causes it to be placed on a particular type of storage media (SSD or HDD). The data placement capability would have to meet both requirements. This lets you place data according to the desired “temperature” of storage device type in a manner that is complementary to the function of existing node topology. In other words, it sets you up to easily express cross-cutting data placement policies.

Policy-based security for better data governance

Another big change for 6.2 is the policy-based security capability that makes data governance better and extends the already excellent out-of-the-box platform-level security of the data fabric. Policy-based security is an entirely new capability in 6.2 that scales your control over data access. The basic idea is that you can define a policy as an access control expression and then apply that policy to data that you keep in the data fabric. Later, if that policy needs to change, you can change it centrally, and the system will enforce access control to all data accordingly without the necessity of updating the permissions on millions, billions, or even trillions of files. This is a huge benefit for large-scale data systems.New policy-based security in HPE Ezmeral Data Fabric 6.2 makes the uniform, platform-level security even easier to apply.New policy-based security in HPE Ezmeral Data Fabric 6.2 makes the uniform, platform-level security even easier to apply.

The result is that you can do a much better and more efficient job of maintaining consistency of data governance even in massive systems. Policy-based security also helps separate the concerns of what a policy should be from the concerns of which data should be subject to a policy. In addition to the benefits of better uniformity and easier update, this separation can actually foster better policies as well as better compliance.

Last access tracking

The new capability for last access tracking is a boon for data governance. Traditionally, last access tracking (or “atime” as system administrators would tend to say) is disabled, especially in large systems because it can cause metadata update storms. In the 6.2 release of the HPE Ezmeral Data Fabric, last access tracking can be gently enabled in a way that does not impinge performance. The way this works is you can define a granularity for updating atime that limits the number of updates required to one update per time window, thus avoiding the metadata storms but preserving the knowledge needed for governance.

Next Steps

For more information, visit this link for a full line up of on-demand HPE Ezmeral sessions and demos from HPE Discover Virtual Ezperience. 

You can also watch this on-demand recorded interview with Ted Dunning (CTO for HPE Ezmeral Data Fabric) from the HPE Discover Virtual Experience last month: Why a Data Fabric is Critical.

And for more information about HPE Ezmeral Data Fabric: 

 

Ellen Friedman

Hewlett Packard Enterprise

www.hpe.com/containerplatform

www.hpe.com/mlops

www.hpe.com/datafabric

 

 

About the Author

Ellen_Friedman

Ellen Friedman is a principal technologist at HPE focused on large-scale data analytics and machine learning. Ellen worked at MapR Technologies for seven years prior to her current role at HPE, where she was a committer for the Apache Drill and Apache Mahout open source projects. She is a co-author of multiple books published by O’Reilly Media, including AI & Analytics in Production, Machine Learning Logistics, and the Practical Machine Learning series.