HPE Ezmeral: Uncut

Great engineering needs to be boring – at least to outsiders!

HPE-Ezmeral-Great-Engineering.pngOne of my first jobs was working on a laser fusion project at Los Alamos. It was exciting, as in giant lasers and high voltage kind of exciting.

In my most recent job at HPE, we build software that runs for years with no user-apparent bobbles or disruption. Even though hardware sometimes fails and bugs are found, users get a stable platform that performs the same way today as it did yesterday. That’s engineering—great engineering, and it also is exciting!

The key to making engineering valuable is to make it rock solid and reliable -- essentially invisible (and boring) to those who use the data. Let’s look at how HPE does this.

Hiding the exciting parts

Some people may think working on data systems all the way down to the bits is boring. Yet, the process of making complex computing technology invisible is actually astonishing. Keep in mind the incredible triumph of modern computing is not in intellectual achievement. Instead, it’s that almost nobody needs to know about most of it.

Think about that for a moment. This technology is amazing because it can be boringly, invisibly, incredibly reliable.

In computer science, the name for what makes this possible is separation of concerns. This separation applies not only between the engineering of data infrastructure and the use of it, but also to technologies that enable separation of concerns between different teams in your business. The efficient separation of concerns between developers and system administrators or between data scientists and IT teams is of huge importance to the practical use of data, especially at scale.

Consider this simple example. A system is built to ingest data from many locations for central analysis. This system naturally has code to deal with data ingestion and code to deal with data analytics. The implementation concerns of ingestion should be isolated to the ingestion code. In other words, the analysis code shouldn’t have to change when new ingestion locations are added. Similarly, the ingestion code shouldn’t need to change when new kinds of analysis are done. To function well, these different concerns should be separated as much as possible.

Furthermore, if the data infrastructure that supports these functions has been engineered well, it should be nearly invisible to both ingestion and analysis teams. This allows each team, including the infrastructure team, to focus on their own specialized work. The exciting stuff should be in what teams are building from data, not in how difficult it is to use the data technologies that make data and computation available.

Making data motion invisible

A good example of this kind of separation of concerns in more complicated situations is the way that data motion happens in the background with the large-scale file system I work on. Here’s why that matters in practical terms:

Data on one storage device needs to be on another in a variety of situations. This need for data motion might be because a user asked for data to be mirrored to another cluster. Alternatively, data motion could be required to recover from a disk, server, or network failure. New hardware might have been installed or the read\write operations of one workload may have begun to collide with those of another.

Once you start moving data, especially when you are moving data for reasons most users don’t need to know about, you can’t let that data motion interfere with what users expect to be happening. This isn’t easy, since we’re often talking about a system with tens of thousands of read and write operations in flight at one time aimed at many thousands of disks on hundreds of servers.

To protect against interference, we extend some of the ideas used to avoid network congestion to the way our data infrastructure handles data motion. The data infrastructure monitors how long each message takes to make a round trip.  Then the infrastructure software watches and automatically adjusts for interference. The really cool thing about this approach is that applying these ideas to globally optimize an ensemble of data transfer processes at scale turns out to be much more effective than are the current methods for the more widely studied problem of optimizing a network link. The result is that available network and disk resources can be completely saturated with background transfers, yet operations with higher priority cut in almost instantaneously when needed. This approach works even when transfers involve substantial latency.

This is tremendously exciting stuff from the internal engineering viewpoint, but what really shakes the ground is this can all happen without users having to turn any knobs or make any adjustments. The technology for moving data completely recedes from view. Most users don’t even need to know these transfers are happening except in rare situations where a great mass of hardware has suddenly failed. The application-level concerns of users are completely insulated from the systems-level concerns of system maintenance and repair. As a result, system performance, as nearly as possible, is very predictable.

Keep in mind that automated control and prioritization of data transit is only one example of how data infrastructure can be designed and engineered to support separation of concerns on many levels. It automatically does what is needed while receding from view.

A practical solution

In the data infrastructure I work on, called HPE Ezmeral Data Fabric, we expose something called a data fabric volume to allow users to manage many aspects of security, compliance, and large-scale data motion at a platform level rather than at an application level. For most users, almost all of the time, such volumes are indistinguishable from ordinary directories, but they provide the key handle by which data can be managed. Most aspects of management are automated inside the data fabric itself. Embedding these management aspects in the fabric allows them to fade from view, while the functional aspects of working with data that developers and data engineers need to focus on can take on full salience.

To find out more about how data fabric volumes provide separation of concerns, read “What’s Your Superpower for Data Management?” and “How to Discard Data: Solving the Hidden Challenge of Large-Scale Data Deletion.” Data fabric gives you the freedom to make data logistics the kind of boring you want, while we help keep the excitement of building the data fabric to ourselves.

About Ted Dunning


Ted Dunning is chief technologist officer for Data Fabric at Hewlett Packard Enterprise. He has a Ph.D. in computer science and is an author of over 10 books focused on data sciences. He has over 25 patents in advanced computing and plays the mandolin and guitar, both poorly. 





0 Kudos
About the Author