Leveraging AI and ML to solve infrastructure problems - Under-the-hood series part 1

Ronak_Chokshi · ‎10-26-2021

This is the first in my “under-the-hood” blog series where readers will get an in-depth view of how HPE approaches solving difficult infrastructure problems with AI and ML, as well as learning how we make infrastructure invisible for customers using HPE InfoSight.

Every enterprise is looking to harvest and analyze data by leveraging artificial intelligence (AI) and machine learning (ML) to their advantage. This gathering and analysis opens opportunities for creating new business models, developing entirely new data-driven products, and bringing in new levels of efficiencies across the enterprise.

But the journey isn’t without hurdles.

Extracting value from data in the real world is often a much more complex problem than most would think. The process of exploiting best practices in AI and implementing them in a way that benefits your business is usually quite sophisticated. It takes years of iterating between collections of relevant datasets, developing an increasingly large set of ML models, deploying them on the target and monitoring their performance over time.

Then – rinse and repeat.

Industry-leading AIOps with a decade-long history

At HPE, we have spent the past decade hardening the HPE InfoSIght platform to understand – intimately – the problems that our customers face every day.

HPE was an early adopter of advanced AI to address IT infrastructure-related problems. Over the years, we have built a streamlined process by which we learn how such problems occur. This has helped us to predict 86% of problems before they even occur, so we can alert our customers to the impending event to eliminate errors and improve uptime for them.

This has been made possible by the telemetry data collected over the years from our global installed base, matched with our ability to develop predictions using these datasets, and deploying automated support cases for our customers. This approach has become increasingly important to customers as infrastructure complexity compounds over the years, and the nature of problems – and their unexpected frequencies of occurrence – can make trouble-shooting issues ever more challenging.

That’s when it helps to understand how we have executed on our promise of AIOps with HPE InfoSight. IT administrators don’t have to face disruptions or downtime anymore with HPE infrastructure.

Production AI relies on data and expertise

Deploying AI in production requires a hardened data science process that starts with a well-defined set of use-cases that need to be addressed. But – before I go into some of the use-cases that have kept us busy lately – let’s take a moment to first appreciate what makes it possible to address those use-cases. This process of extracting actionable insights is underpinned by two factors: datasets and domain expertise.

Datasets

Deploying AI doesn’t necessarily require collecting every possible dataset you can get from your infrastructure and applications, and dumping them into a data lake. Instead, it is highly recommended that one starts with a clear problem definition, and then compare and contrast with the dataset that is relevant to that experiment. Bottom line: success comes down to designing the infrastructure systems with the right depth of instrumentation from day one.

Domain expertise

We have all seen – and perhaps tried to learn from – the number of AI tutorials out there. Academic research and open-source technologies are great starting points for AI projects. That said, if you start with some of the well-curated examples available in ML tutorials, you will very quickly find it difficult to deploy ML in your organization without the domain context. Having the relevant domain expertise is a crucial requirement for this process to be successful.

Real-life examples of infrastructure issues and our data science approach to solving them

Now, let’s review some real examples of challenges that our data science experts have faced over time, and more importantly, how they have successfully addressed them.

1.) Inability to measure performance issues. Oftentimes, we don’t have direct measurements of the infrastructure problem that we want – and need – to understand. For instance, it would be important to understand if your infrastructure will still be able to support the spikes that are expected two weeks from now – and what impact the new app workload could have on the available storage and compute resources.

The HPE approach to solving. Thanks to our globally connected installed base, we solved this issue by developing surrogate measurements and evaluating them against the best approximation of a direct measurement we could find. We also sought out additional means to improve the accuracy of our measurements, and triangulate them from multiple directions.

2.) Inadequate data at the peripheral limits of system performance. The data we have is concentrated in an area of parameter space separate from where the model will primarily be used and scrutinized. To quantify the capabilities of different hardware platforms – as with most enterprise systems – we rarely see hardware reach the extreme limits of its capabilities. And yet – that is the information that is most critical to understand for sizing and workload placement decisions.

The HPE approach to solving. Originally, we had to select from target environments to ensure that the model behavior, outside of the observed region, was a reasonable extrapolation. Now, tools like TensorFlow Lattice are available to tackle this problem in a more maintainable way. TensorFlow Lattice allows for monotonicity, concavity, convexity, and other related constraints to be applied to inputs where domain expertise can be applied.

We incorporated these two use-cases in our recently announced Resource Planner capability. To learn more about this capability, check out the announcement blog.

3.) Detecting anomalous behavior in applications. The detection of changes in applications’ behavior is a use-case that spans most customer applications and has broad applicability. Significant behavior changes can point to a variety of important issues that require intervention, including downtime, surges in usage, bugs in newly deployed software, and even ransomware attacks. Detecting these behavior changes is particularly valuable when the underlying infrastructure can either be blamed or exonerated for the change.

The HPE approach to solving. Separating true anomalies from the periodic or even sporadic events is the real challenge in this use case. To address this, we took a rich representation of the historical patterns using a specific type of time series model that approximates the distribution of application-level data, taken over different periods of time, e.g. every hour, day or week.

We then correlate this data with metrics from the underlying infrastructure components — storage volume, array and more — which helps us further qualify the anomalous behavior. If there is evidence that one or more of the components has an operational issue which resulted in the application-level anomaly, we can clearly pin-point the problem area, and offer a contextual recommendation to correct it.

There are no clear-cut rules for when models should be split or combined, so experience, domain expertise, and a healthy amount of experimentation are all required.

To experience the AI-driven infrastructure and to learn more about AIOps, visit HPE InfoSight or log onto HPE InfoSight portal.

Ronak Chokshi
Hewlett Packard Enterprise

twitter.com/HPE_Storage
linkedin.com/showcase/hpestorage/
hpe.com/storage

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Leveraging AI and ML to solve infrastructure problems - Under-the-hood series part 1

This is the first in my “under-the-hood” blog series where readers will get an in-depth view of how HPE approaches solving difficult infrastructure problems with AI and ML, as well as learning how we make infrastructure invisible for customers using HPE InfoSight.

Industry-leading AIOps with a decade-long history

Production AI relies on data and expertise

Real-life examples of infrastructure issues and our data science approach to solving them

Ronak_Chokshi

Author

Kudos