The easy button for data management: Data fabric volumes

Ellen_Friedman · ‎03-04-2022

One size DOESN’T fit all when it comes to making large-scale data systems easy to manage. Learn how to get the perfect fit!

With the increasing complexity and challenges of large-scale data systems, it’s no wonder many people are looking for the easy button. But does easy really get you what you need?

The answer can be yes. But only if you have an effective overall data strategy along with data infrastructure that delivers the right capabilities. Otherwise, the danger is that an easy approach ends up as a poor fit. Despite the claims made on labels of some clothing or tools, one size rarely fits all.

The same one-size-DOESN’T-fit-all issue also applies to running a large-scale system. Easy is good, but a convenient approach must also meet the challenges of modern data systems.

Dealing with data at scale: some challenges are unexpected

With a large-scale system, people naturally look for ways to optimize resource usage to keep costs down. In addition, they want data to be protected and to have a system reliable and fast enough to meet performance requirements.

Less expected challenges also show up at large scale. How will you deal with data storage and management in complex systems that serve multiple users and different types of applications? How will you make data available at different locations on premises, in cloud, or in many locations (in the case of edge deployments)? Additionally, how will you adapt to change? It’s one thing to deal with current challenges of scale, but you also need scalability and the capability to accommodate larger amounts of data without explosive costs. Scalability also involves handling new types of data and applications, plus new locations, without having to re-architect your system.

Effective management at large scale requires simplification. Otherwise, the system becomes cumbersome, and the burden on IT resources becomes overwhelming.

Read the blog post “The case for radical simplification in data infrastructure”

That’s where the easy button comes in. But be wary of making things easy by giving up capabilities you need for the right fit to your situation. You shouldn’t have to pay for convenience by sacrificing flexibility to work with your tools of choice, by limiting performance, or by giving up fine-tuned data management and security capabilities.

How do you meet expected and unexpected challenges of large scale without having to compromise convenience or precision?

Configured, not coded: a key to scale-efficiency

A key aspect of fine-tuned but convenient data management at scale is data infrastructure that lets fundamental actions be configured instead of having to be coded into each application. Simplification such as this is the ultimate sophistication.

Consider, for example, the need for efficient data motion at scale in IoT systems with many edge data sources. Edge issues are not limited to large, industrial IoT companies such as telecommunications, manufacturing, oil and gas exploration, or autonomous car development. Edge is where your business happens – where your customer is. Simon Wilson, CTO Aruba, says, “What edge means to organizations is their ability to extend their business, both for employees and also for their customers, to wherever they need to be.” Many organizations need their edge to be at the point of transactions, in physical retail or banking sites, or on their websites, as illustrated in Figure 1.

Data from credit card transactions collected on a regional cluster and mirrored to data centers

Data motion via built-in, incremental mirroring rather than individually coded into each application is not only fast, efficient, and affordable; it’s also less prone to errors. Handling key actions at the data platform level also frees up developers’ and data scientists’ efforts to focus on the goal of their applications rather than eating up the time budget on data logistics.

Read the free ebook AI and Analytics at Scale: Lessons from Real-World Production Systems

Control needs to be at just the right level to be practical in a large-scale system but still be able to differentially manage many aspects of the lifecycle of data, including who has (and does not have) access. In other words, what you need is a superpower for data management.

Superpower for data management: Data fabric volumes

HPE Ezmeral Data Fabric File and Object Store is a software-defined, highly scalable data infrastructure to store and manage data with convenience plus precision. It can do this in part because of its unique super power for data management: a key data management construct known as a data fabric volume. A data fabric volume is like a directory but with surprising capabilities. For example, built-in data fabric mirroring from edge to data center or between data centers is carried out using data fabric volumes. This data motion is illustrated in Figure 2.

Figure 2: Data fabric volumes (triangles) are the basis for major data motion via mirroring

In data fabric mirroring, data from a source volume (green triangle in Figure 2) is copied by moving incremental updates via a point-in-time snapshot to a destination volume (orange triangle) on a cluster at another location, either on premises or in the cloud.

But data motion is just one of many data challenges handled conveniently at the platform level in data fabric. I’ve written about many things you need to do to manage a large-scale data system in a highly performant and cost-efficient way. What these solutions share is that they are all mediated by data fabric volumes. (Please refer to the end of this blog post for a list of articles on how data infrastructure can address key challenges to support a scale-efficient system.)

How do data fabric volumes give you the perfect fit?

Data fabric volumes are radically different from volumes associated with a block storage device. A data fabric volume is essentially like an expandable bag or a file system directory: it doesn’t take up space until filled with data. In HPE Ezmeral Data Fabric, data is replicated for reliability. Data placement can be controlled with data fabric volumes. This is how replicas (shown as colored hexagons in Figure 3) are automatically distributed across multiple machines in a cluster, putting them into different failure domains for safety.

Figure 3: Data fabric volumes (represented by transparent triangles) span multiple servers

Volumes are also the basis for configuring finely tuned data locality, such as placing specific data on special hardware like SSDs for optimal performance. Notice in Figure 3, a third data fabric volume is shown on differently shaded machines. Storage labels make it possible to tune data placement even at the sub-machine level. Data fabric even lets you deal with cross-cutting requirements for data placement.

Optimizing resource usage is another important aspect of effective data management at large scale. Data fabric’s built-in data tiering using a new approach to erasure coding helps provide an easy way to do this at the platform level but with precision. In Figure 4, the dark triangle is a data volume, while the grey triangle is a shadow volume associated with the data volume.

Figure 4. Erasure coding in the HPE Ezmeral Data Fabric

One advantage of this new approach to erasure coding is that applications read from files via the same pathname before and after data is erasure coded, regardless of the internal data format in the file. A second advantage of data fabric erasure coding is that policy is set at the high level of a data fabric volume for convenience and efficiency – think easy button – but is executed at the fine-grained level of individual files – think easy button with precision. Furthermore, actual erasure coding is delayed until triggered automatically by data usage patterns. As a result, the encoding process is not in the critical write path, so it doesn't affect workloads.

Resource optimization is also helped by data fabric volumes’ role in solving the often overlooked challenge of large-scale data deletion.

A basic but important aspect of managing large data systems is tracking accountability. Once again, data fabric volumes come to the rescue. Read the blog post “Who pays? Easy accountability in large scale systems” to find out how.

Another area for which an effective easy button is needed is to manage fine-grained control over who has data access in complex, multi-tenant systems. Data fabric volumes play a role here as well. Differential data management via access control expressions (ACEs) and policy-based security can be applied via data fabric volumes. These approaches make it convenient even in very complex systems at large scale to adapt to changes when users are added or removed or when security strategies change.

Watch the video “HPE Ezmeral Data Fabric 6.2 Policy-Based Security”.

Data must be protected against everyday disruptions (“fat thumb” effect). Data fabric snapshots based on volumes are remarkably effective as a data time machine to undo errors. But data must also be protected against catastrophic events, either natural disasters or human-caused threats such as ransomware attacks. A data recovery plan involves setting up a secondary data center at a distant location, and data fabric mirroring makes this fast and affordable. But the bigger question is, how long will recovery take? Downtime is expensive! With mirrored data fabric volumes, recovery is super fast.

Read the blog post “The cost of data recovery: does your disaster recovery plan really work?”

Data fabric snapshots also are useful beyond protection from careless typing: they provide excellent data version control, a particularly important issue for data scientists. And keep in mind that data fabric volumes contain files, tables, event streams, and directories all together, another boon for convenient data management.

Figure 5. Using the standard Linux tool “ls” to list contents of an HPE Data Fabric volume

Easy button to give a tailor-made fit

Organizations are struggling to find an easy way to manage large-scale data systems. Unfortunately, a one-size-solution is NOT the answer. Yet, a simple solution is available. Data fabric volumes turn out to be the easy button that delivers a custom-made, perfect fit for organizations with large-scale data needs.

To find out more, please visit HPE Ezmeral Data Fabric File and Object Store webpage or watch the replay of the webinar “AI and Analytics at Scale: Lessons from real-world production systems” with Ted Dunning and Ellen Friedman.

To see what key capabilities for a scale-efficient system are mediated through data fabric volumes, read these blog posts:

Ellen Friedman

Hewlett Packard Enterprise

www.hpe.com/containerplatform

www.hpe.com/mlops

www.hpe.com/datafabric

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

The easy button for data management: Data fabric volumes

Ellen_Friedman