HPE Ezmeral: Uncut

Data versioning for data science: data fabric snapshots improve data and model management

Dat-Version-HPEEzmeral-Data-Fabric2.pngIn machine learning and AI, data makes the model. And it’s a great thing when you’ve selected the right training data and tuned your algorithms well to get the model performance you need for the target situation. Success!

But can you do it again? What happens when you need the exact same training data used to train a particular model at a particular point in time – and you need to identify it among hundreds of models you have stored? Because at some point, you will need to be able to find and reuse a specific set of training data if you are to continue to succeed.

The ability to keep accurate point-in-time versions of data is particularly important in AI and machine learning systems, but it’s not limited to those projects. This key capability is also useful for analytics in general.

Is it possible to effectively deal with data versioning in a complex system at large scale built across multiple locations? The answer is yes.

Why data versioning matters

One reason data version versioning matters is because machine learning and AI are iterative processes. Code is written for a new model, and then the model is trained by exposure to carefully selected training data. The nature of the resulting model literally depends on the data used to train it as well as the learning algorithm and parameter settings. Model performance is evaluated and, to get the desired performance, often it’s necessary to tweak the learning algorithm or adjust training data and then carry out the training process again and again.  

Even after a model is deployed in production, the process continues. Ongoing monitoring of model behavior may trigger the need to start over and train new models. The world constantly changes, so the fitness of a trained model may also change, requiring retraining on new data, with new settings or may require an entirely new model. But if everything is changing, how can you tell what is working and what is not? Being able to access exactly the data set used to train a particular model is also important for accurate model-to-model comparisons. This ongoing process of training - evaluation - adjustment - retraining or development of new models is depicted in Figure 1. 


Figure 1. Machine learning is an iterative process for which training data is key

To be successful, data scientists need to easily and accurately find the exact data seen by a particular model or outputs for a specific iteration of the system so they can control what changes in their tests. Think of it this way: Data versioning is the complement to version control for code, a familiar concept. A tool such as Github may be used to keep track of different versions of code and who has forked a new version, as shown in Figure 2. 

Version-control-for-code.HPE-Ezmeral.pngFigure 2. Familiar process of version control for code

Just as version control is essential for code, the same applies to data. Data version control can be challenging even at moderate scale, in part because so many models are involved, but the challenge of data versioning becomes much greater in large-scale, complex systems. 

Data fabric snapshots: built-in data versioning at scale

Data infrastructure designed for unified analytics and data science in large scale systems should provide efficient built-in data versioning. HPE Ezmeral Data Fabric File and Object Store is a software-defined solution engineered to handle data storage, management, and motion in a unified layer from edge to data center, on-premises, or in cloud.

Data version control is efficient and convenient using data fabric snapshots because they. provide an accurate point-in-time version of data, even if that data is enormous. Snapshots in the data fabric are done at the level of the data fabric volume, which is like a directory, but with management handles on it. The data fabric volume is your superpower for data management. Files, directories, tables, objects, and event streams are all engineered as part of the data fabric and are stored and managed using data volumes. 

Data fabric volumes, by default, spread across an entire data fabric cluster for better data safety and performance, as shown in Figure 3. Volumes are represented in the figure by large transparent triangles containing multiple data replicas depicted as colored hexagons. 

Data-Fabric Volumnes-HPE-Ezmeral.pngFigure 3. Data fabric volumes span multiple machines in a cluster

Snapshots of each volume can be created manually or on a schedule, making them ideal for preserving versions of  data. Having this repeatable process is essential to build and train models successfully over time as well as to serve as data protection against human or machine errors that could unintentionally delete or alter data. In a way, the data fabric snapshots serve as a time machine for data, as suggested by Figure 4 (green triangle is the original data fabric volume, and each grey triangle represents a different snapshot made at different points in time.)

Data-Fabric-Snapshot-Preserve-HPE-Ezmeral.pngFigure 4. Data fabric snapshots preserve exact versions of data at different points in time

Making snapshots is easy and takes less time than forking a repository on Github, so they don’t impose a penalty on time budgeted for machine learning/AI projects. 

Efficient data version control in a large-scale system solves a major challenge for data science and unified analytics, but deploying models is also a challenge. Being able to control the deployment of trained models with accuracy and consistency, even across multiple locations including edge, is also key to success. Once again, data fabric volumes provide a built-in solution.

Mirroring: deploy models to multiple locations with confidence

Data fabric mirroring is an efficient way to move any kind of data, including code or executable programs. This capability makes mirroring ideal to deploy the exact same model to multiple locations, a process that often is important for systems with edge computing. 

In addition, mirrors use snapshots internally so when you mirror a data fabric volume, changes to the entire volume are applied atomically. No partial changes are ever visible. As a result, coordinated changes across multiple files will stay coordinated at the mirror destination. This tight coordination is particularly useful in deploying models such that the model code and configuration files can be kept consistent as new models are deployed.

Successful data and model management

Keeping track of exactly what data was used to train each successful model can be challenging when working at large scale. Data fabric snapshots and mirroring make data version control and model deployment much more predictable. And that, in turn, makes success more likely to be repeatable.

For more on working with data at large scale for AI/machine learning and analytics:

Watch the video replays of presentations from HPE Data World December 2021

Read the blog post “Escape the maze: manage large-scale complex systems with flexibility and simplicity” 

Read the blog post “Dataspaces: how an open metadata layer can establish a trustworthy data pipleline

Download free pdf of O’Reilly ebook AI and Analytics at Scale: Lessons from Real-World Production Systems © 2021 by Ted Dunning and Ellen Friedman

Ellen Friedman

Hewlett Packard Enterprise




0 Kudos
About the Author


Ellen Friedman is a principal technologist at HPE focused on large-scale data analytics and machine learning. Ellen worked at MapR Technologies for seven years prior to her current role at HPE, where she was a committer for the Apache Drill and Apache Mahout open source projects. She is a co-author of multiple books published by O’Reilly Media, including AI & Analytics in Production, Machine Learning Logistics, and the Practical Machine Learning series.