AI Unlocked
1767188 Members
829 Online
108959 Solutions
New Article
Ellen_Friedman

Data locality at different scales: The value of fine-grained data placement

HPE-Ezmeral-data-locality-at-scale.pngLocation, location, location! That’s not just a key to value in real-estate. Locality also makes a big difference to the value of data in distributed systems. Here’s why.

Impact of data locality in large-scale systems

An obvious reason to control data placement in large-scale systems is for safety. Spreading data replicas across different racks in a cluster, for instance, puts redundant copies into different failure domains. If one rack is damaged or fails, data replicas in other domains keep the data safe and the system operational.

But data locality can also have an impact on both performance and efficiency of resource usage. Workloads differ in their requirements for computational power and latency. For example, in AI and machine learning projects, data latency and computational requirements usually are quite different at various stages in the lifetime of models. The actual learning process, when models are trained, tends to be compute-intensive as compared to running models in production. Training also requires high throughput, low-latency data access: it won't do to make a fancy GPU wait for data. 

To achieve high performance computing for training AI models or other computational-intensive applications, many people make use of high performance computing (HPC) machines with specialized numerical accelerators such as graphical processing units (GPUs). How, then, do you support GPUs and other accelerators from a data point of view?

Often, the solution in large systems is to give dedicated HPC machines high performance storage, such as solid-state disks (SSDs) or nVME drives. You can then provision regular machines with slower, spinning media (HDDs), capable of handling large amounts of data storage at low cost. This type of large-scale cluster is depicted in Figure 1. 

figure-1.png

Figure 1. Large cluster containing a combination of dedicated, fast-compute/fast storage nodes (orange) and regular nodes/slower storage devices (green) 

In the figure, orange squares represent SSDs and orange lines represent machines with computational accelerators (such as GPUs). Green cylinders stand for slower spinning storage media (HDDs) and servers with green lines indicate traditional CPUs. In a typical machine learning/AI scenario, raw data is ingested on the non-HPC machines, where data exploration and feature extraction would take place on very large amounts of raw data. In a scale-efficient system, bulk analytic workloads, such as monthly billing, would also take place on the non-HPC (green) machines. 

Once feature extraction is complete, training data is written to fast storage machines (orange) with SSDs and GPUs, ready to support the model training process. Other compute-intensive applications, such as simulations, can also run on the fast machines. 

Controlling where different data resides improves performance and efficiency of resource usage. But to make this feasible in a large-scale, multi-application system, you need data infrastructure designed to let you assign what data will be placed on which machines. One way to do this is with HPE Ezmeral Data Fabric, a highly scalable, unifying data infrastructure engineered for data storage, management, and motion. Data fabric is a software-defined, hardware agnostic solution that lets you conveniently position data at the level of different racks, machines, or even for different storage types within machines. 

 

HPE Ezmeral Data Fabric lets you easily match application data throughput and latency requirements with the appropriate type of storage media.

 

The advantages of convenient data placement is not just limited to AI/machine learning applications. It is a key capability needed for high performance at reasonable cost in a scale-efficient system. With a large enough system and a data infrastructure to help you implement data locality, you have the option of specializing different nodes for various types of tasks (nodes with fast storage devices and specialized accelerators for high performance computing; nodes with slower spinning storage and traditional CPUs for bulk workloads).  But what about medium to smaller scale systems that often do not have the luxury of dedicating entire nodes for special storage device types? 

Medium-to-small-scale systems need finer-grained data placement

In smaller-scale systems, you could meet your need for high performance computing by employing some heterogeneous machines -- nodes having fast-compute capabilities but with a mix of different kinds of data storage devices rather than just SSDs. This arrangement is shown in Figure 2.  

figure-2.png

Figure 2.   Small cluster containing fast-compute nodes (orange) having a mixture of SSDs (orange squares) plus slower HDDs (green cylinders) and regular nodes with HDDs only. 

In this case, you would still want to control data locality. This way you can match application requirements for data throughput and latency with the appropriate type and amount of storage media, just as you would for large systems with dedicated special-storage nodes. Otherwise, you would not be getting the full advantage of the heterogeneous machines. Fortunately, HPE Ezmeral Data Fabric version 6.2 lets you use storage labels to do just that.

Here’s how it works. 

Fine-grained data locality with HPE Ezmeral Data Fabric

HPE Ezmeral Data Fabric has always enabled you to configure data placement down to the level of individual machines; and with the addition of the storage labels feature, you can extend that control to the level of different types of storage devices within machines. Figure 3 shows how you would apply storage labels for fine-grained data placement with the data fabric. It also reveals the advantage for optimizing both usage efficiency for storage resources and desired performance.

figure-3.pngFigure 3. Using the storage labels feature of HPE Ezmeral Data Fabric for differential data placement on particular types of storage devices at the sub-machine level

In the above figure, the term volume refers to data fabric volumes (data management units holding files, directories, NoSQL tables ,and event streams all together) that act like directories with superpowers for data management. Many policies, including data placement, are assigned to volumes.

Cross-cutting requirements for data placement

What happens when you have more than one goal for the placement of particular data? HPE Ezmeral Data Fabric takes care of that automatically by having a data volume meet multiple requirements. In the example shown in Figure 1, a data fabric volume would be distributed across multiple machines on multiple racks but within topologies designated for different failure domains. But that data volume still must meet the additional requirements imposed by storage labels, such as a label that causes it to be placed on a particular type of storage media (SSD or HDD). In other words, data fabric lets you easily express cross-cutting data placement policies.

Taking advantage of the best location

Large or small, your system provides the best value when you take advantage of the best location for data. The convenient-but-fine-grained data placement provided by HPE Ezmeral Data fabric – from assigning topology to a data fabric volume to applying storage labels that place data on the type of storage media where it will have the most value – make it easy for you to make the most of location, location, location!

To find out more about how HPE Ezmeral Data Fabric serves as a unifying data infrastructure supporting multiple large-scale analytics and AI applications from edge to data center, on-premises and in the cloud, read the solution brief “HPE Ezmeral Data Fabric: Modern management for your data-driven enterprise” or watch this short animated video about data fabric

To explore additional “superpowers” of data fabric volumes, read the blog post “Business continuity at large scale: data fabric snapshots are surprisingly efficient”.

HPE Ezmeral Data Fabric also serves as data persistence for containerized applications. Find out more in this description of the HPE Ezmeral Container Platform

Ellen Friedman

Hewlett Packard Enterprise

www.hpe.com/containerplatform

www.hpe.com/mlops

www.hpe.com/datafabric

 

 

 

0 Kudos
About the Author

Ellen_Friedman

Ellen Friedman is a principal technologist at HPE focused on large-scale data analytics and machine learning. Ellen worked at MapR Technologies for seven years prior to her current role at HPE, where she was a committer for the Apache Drill and Apache Mahout open source projects. She is a co-author of multiple books published by O’Reilly Media, including AI & Analytics in Production, Machine Learning Logistics, and the Practical Machine Learning series.