Around the Storage Block

AIOps for VMs: HPE InfoSight Performance Recommendations gives the right Rx

InfoSight AIOps-PrescriptionSlowApps.png

According to a recent survey, the average employee loses 22 minutes each day to various IT-related issues. As IT permeates more and more of our working lives, resolving IT problems quickly (or even better, preventing them from happening at all) is becoming increasingly important.

Among some of the more challenging problems to resolve are those related to application slowness. We have no obvious breakages or error messages that help us know where to begin; we just know we are losing time, and something, somewhere needs to be fixed.

Wouldn’t it be nice if a fast diagnosis and quick fix were available? Now there is.

The Problem: The dreaded slowness of applications

Application performance problems are insidious. They can sneak up on you, creeping in gradually or appearing suddenly. Because there are so many potential sources of performance problems, getting to the root-cause of “slowness” can be an enormous undertaking.

Even with monitoring tools in place, it is easy to be overwhelmed with pages of timeseries charts and alerts based on naïve thresholds. Determining what issues to tackle and how to tackle them requires context that just isn’t available in those charts and alerts themselves. Bridging the gap between raw information and actionable solutions often requires deep subject matter expertise in a variety of different technical areas. These skills are hard to find and bring to bear when a problem arises.

The Solution: Actionable intelligence

HPE InfoSight cross-stack VM recommendations for HPE Nimble Storage aims to fill this gap by providing explicit and actionable recommendations in plain language. Furthermore, these recommendations are automatically filtered and prioritized based on relevance and relative severity of those issues.

I’ll give you a concrete example of how this works: Let’s say I get a call that the application running on a set of virtual machines is “running slowly.” I can go to the HPE InfoSight Recommendations tab in the cloud-based web portal and search for the names of those VMs. Once there, I will see a summary of the problem conditions identified on those VMs, the diagnosis (i.e. root-cause) identified for those conditions, and InfoSight’s recommendations for how to correct the detected problems.

Without HPE InfoSight, this troubleshooting would look very different. Let’s say I am fortunate enough to have access to, for example, Grafana dashboards that chart my VMWare metrics. To identify problems on these VMs, I would need to separately examine numerous VM CPU metrics, I/O metrics, memory metrics as well as related host and datastore metrics, if not more. Even doing that, I am not guaranteed to know how to interpret whether the values of these metrics indicate a problem. Additionally, if they seem to indicate a problem, it is unclear which potential problem among several candidates is the one I should address first.

Additional challenges arise if an issue outside VMWare is suspected as the root-cause. In that case, I also need to hunt down the suspected components (e.g. the specific storage devices that serve the VMs in question) and examine metrics from those components as well. There again, I have the same kinds of questions: How high a latency value is too much? How low a latency exonerates the storage device? These are not questions with one-size-fits-all answers. Additionally, the complexity of pointing to a specific issue only increases as more and more components are considered.

How it works

So how does artificial intelligence-driven automation (AIOps) delivered through HPE InfoSight provide these tailored recommendations automatically to tens of thousands of IT environments around the globe?

AIOps takes many forms, but a common approach involves the automated interpretation of operational data. This interpretation is what differentiates the measurement-versus-a-threshold methods of simplistic monitoring solutions from AIOps solutions. AIOps solutions modulate their responses based on a learned understanding of the context of a measurement.

In order to identify and diagnose performance problems, the HPE InfoSight recommendation system learns how to interpret these metrics from telemetry sampled from across the global HPE user base. Machine learning models are trained to summarize the ensemble behavior of various parts of the IT stack from data spanning thousands of user sites. These models are then used to translate raw measurements into interpreted metrics that are many times more actionable than the raw measurements themselves.

Finding Problems: Context matters

Let me give you an example of an interpreted metric. To identify a performance problem, you could naively set a threshold on the raw latency value coming from a storage array. Or using HPE InfoSight, you can take more than 60 distinct measurements of latency per unit time, segmenting latency by block size, read vs. write, sequential vs random.

Those measurements are then fed into a machine learned model of typical behavior trained against the behavior of tens of thousands of peer systems. The result is an interpreted latency severity metric that HPE InfoSight has shown to be many times more predictive of customer-relevant performance issues than the raw, uninterpreted value.

To explain why these interpreted values are so important, I turn to an analogy with heart rate. A heart rate of 150 beats per minute can be concerning if a person has been setting on the couch all day. On the other hand, a heart rate of 150 beats per minute while working out is much more reasonable. Setting a threshold on just a raw parameter like this--without accounting for context--can easily lead to false alarms and/or non-detection of problematic events.

Finding Solutions: Getting to the diagnosis and Rx

Similarly, we also create interpreted metrics to identify the most probable cause(s) of a performance issue. Often, one or more raw metrics measure a known performance bottleneck. Even though the relevant metrics may be well known, a frequent challenge is understanding how those raw metrics (e.g. CPU usage, drive bandwidth usage, cache miss, or virtual CPUs allocated per physical CPU) relate to the probability of a performance problem occurring.

In the HPE InfoSight recommendation system, we train models to learn these relationships from telemetry spanning tens of thousands of systems. This allows us to interpret each of these measurements relative to a common benchmark: how predictive each set of signals is of a performance issue. In this way, we can compare probabilities against probabilities and readily rank which performance bottlenecks are most pressing to address.

To explain interpreted measurements in this context, I turn to an analogy with bloodwork. A blood workup consists of many individual measurements including the concentrations of glucose, red blood cells, and various proteins. In each case, certain values of these measurements are indicative of certain problems with a set of known remediations. The challenge is -- how would a doctor measuring all of these quantities for the first time with no knowledge of how each of these numbers correlates with health problems make that data actionable? Only by knowing these correlations across large numbers of patients do these numbers become useful in decision making.

Summary: Control the overload of performance metrics

In both of the examples above, I describe how models trained against an ensemble of peer data can convert raw measurements into interpretable ones. Non-AIOps solutions leave end users on the hook for interpreting raw measurements -- setting thresholds and adjusting them in response to false positives and negatives. The HPE InfoSight recommendation system requires no such user tuning, since the needed context has already been learned.

By taking guesswork and interpretation out of the performance equation, HPE InfoSight makes even the most sprawling IT environments much more manageable. Leveraging the collective behavior of tens of thousands of IT deployments, HPE InfoSight automation will interpret the performance metrics from your environment so you don’t have to. HPE InfoSight provides a fast diagnosis and the ability to prescribe a quick fix.

Visit the HPE InfoSight webpage to learn how this free, AI technology delivers AI-powered autonomous operations that helps ensure your environment is always on, always fast, and always agile. Watch the HPE InfoSight VM recommendations video for a guided demo of this new feature available on HPE Nimble Storage. You can also try it for yourself by visiting the HPE InfoSight Web Portal Demo -- experience self-service, self-driving demos, and many others spanning your server and storage environment. 

David Adamson

Hewlett Packard Enterprise

About the Author


David Adamson is the Machine Learning Architect for HPE InfoSight. For the past several years, he has applied Data Science and ML towards automating the detection and resolution of problems in enterprise IT environments.