How AIOps can predict and prevent the hairiest infrastructure problems

Ronak_Chokshi · ‎06-17-2021

Infrastructure today is undoubtedly more complex than it has ever been. If we think about the full IT stack — including storage, networking, virtualized servers, and microservices-based applications — it’s packed with complexities, and many of those arise from interactions and compatibility issues between the various layers.

When infrastructure problems occur, IT admins often point to storage, but that blame is frequently misplaced. In fact, an IDC survey¹ reported that more than 90% of infrastructure issues occur above the storage layer. The upshot is that infrastructure complexity has reduced IT’s ability to resolve problems in the desired timeframe: the mean-time-to-recovery (MTTR) for infrastructure problems is getting longer, not shorter.

The 80-20 rule of infrastructure problems

As you can see in the chart below, most infrastructure issues (80%) fall on the left side of the chart — they’re high-frequency and simple-to-resolve. These problems aren’t very painful (20% of the pain) and don’t have a big impact on downtime. An example might be protecting IT from upgrading to incompatible versions of a storage operating system, or detecting a drive failure and ensuring it isn’t restored before ordering its replacement. Such issues can be resolved using basic analytics or automation techniques using univariate analysis (single variable). While many infrastructure vendors claim to have “predictive analytics”, this is really the extent of their capabilities—basic diagnostics.

Figure 1. The 80-20 rule of infrastructure problems

IT teams feel the greatest amount of pain with problems that fall on the right side of this chart (20% of the issues, 80% of the pain). The long tail of low frequency problems towards the extreme right of the chart tends to inflict the most significant pain and can result in extended downtime. An example in this category: intermittent performance latency for critical applications. Correctly determining the cause in this case requires looking at multiple numeric timeseries simultaneously, since there is no single metric that can reliably identify the bottleneck. You need a well-trained machine learning (ML) classifier model that observes performance telemetry — coming out of both your storage and the layers above it — for likely causes of latency and then determines their relative impacts.

As you can imagine, it’s impossible to resolve such issues instantly by pulling application logs and troubleshooting using traditional trial and error methods.

An AIOps platform to the rescue

For such complex infrastructure issues, a new category of AIOps platform is needed — a platform with the power to solve intricate problems based on multi-variate analysis. In effect, AIOps has a three-fold goal: reducing MTTR, automating the troubleshooting process, and ensuring issues never arise again. With the right AIOps platform, we can predict and resolve some of the hairiest problems that IT admins face.

HPE InfoSight is our flagship AIOps platform. It leverages telemetry data from connected infrastructure, along with the power of cloud and AI, to bring analytics, automation, and predictions to IT operations (ITOps). Check out one of my previous blogs in which I describe how we combine the telemetry data from our connected install base with our cloud and AI capabilities. Using this combination, we’re making the issue diagnosis process shorter — and our predictions better — for that long tail of hairy problems.

“But can’t any AIOps platform do this?” you might ask. The answer is no.

Predicting the long tail of hairy infrastructure problems

The challenge that many vendors face is a lack of an industrial process and a ton of labeled telemetry data. We address this challenge in a couple of ways.

Some of our supervised machine learning models are designed to learn the normal behavior of customer environments, in which case we have as much labeled data as we have data. In those cases, our models are providing a baseline for searching for anomalous behavior or for common correlations.

In other scenarios, we need to be more targeted and build models that identify specific known issues. In such cases, our technical support team identifies examples of the issue, but this is still inadequate for us to develop accurate ML models. To correct this, we turn to our install base to conduct some semi-supervised training rounds:

We take the examples provided by the support team and train a ML model to identify them.
We then use that model to scan our install base telemetry for signs of similar scenarios.
We find scenarios that “look” similar to the model.
We bring them back to the Technical Support SMEs to provide the correct labeling.
We then retrain the model and go fishing in the install-base again.
This iteration repeats until we are confident the model has captured a sufficiently generalized representation of the issue.

The resulting ML model is integrated into the Recommendations engine of HPE InfoSight, so we can predict forthcoming problems and alert our customers where necessary. This process is many orders of magnitude more efficient than having our SMEs manually comb through the install base after an issue occurs.

Combining our data science and technical support functions as part of this industrialized process is yet another attribute that truly distances us from our competition.

The end-result for HPE is our ability to deliver autonomous IT operations to our customers, which frees them to focus on innovation — a true customer delight.

To learn more about HPE InfoSight, please check out The Gorilla Guide to AI-driven Operations with HPE InfoSight

¹Source: Server & Storage Infrastructure Availability Survey 2018, IDC, December, 2018

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

How AIOps can predict and prevent the hairiest infrastructure problems

The 80-20 rule of infrastructure problems

An AIOps platform to the rescue

Predicting the long tail of hairy infrastructure problems

Ronak_Chokshi

Author

Kudos