Servers & Systems: The Right Compute
1754401 Members
3480 Online
108813 Solutions
New Article
Leslie_Tung

AIOps in production: Discover a smarter way to manage HPC systems

Learn how HPE is using artificial intelligence and machine learning to develop advanced, non-threshold-based real-time analytics to reduce data center downtime via rapid and early automated anomaly detection.

HPE-HPC-Cluster Management-AI-blog.jpg

Imagine this scenario: You are a system administrator running the day-to-day operations of a large high-performance computing (HPC) center. You are just about to sit down to drink your morning coffee when the receptionist calls to check if you approve access to the data center for HPE technicians. “Thanks. Just let them in,” you say. You knew that they would be coming. You received an email from HPE yesterday stating that CDU number P245 in aisle #17 was going to fail so they would be sending service folks today to take care of that. Back to your coffee then. This is just another service visit after all.

 HPE might not offer this kind of service just yet (it is on our roadmap!), but we already offer our customers real-time anomaly detection for their hardware using artificial intelligence (AI). This means that you can take corrective action faster and prevent system failures.

The larger the system, the bigger the problem

Managing HPC systems has always been challenging. Administrators need to sift through loads of monitoring data and multiple dashboards to identify system issues. Traditional threshold-based monitoring leads to too many false alarms. Miss a critical alarm and you are held accountable for a costly system downtime. As HPC systems are growing larger and more complex, the amount of data that operations teams need to analyze to perform their work is growing exponentially. And their job is getting much more difficult.

According to the Uptime Institute, nearly 50% of all data centers they interviewed experienced a significant outage in the last couple of years.160% of survey respondents also believed that their most recent significant failure could have been mitigated or prevented – this is an area in which anomaly detection plays a large role.2

That is why HPE is now using AI to help customers simplify management their IT operations, uncover issues, and react to them faster. Introducing AIOps.

AIOps on your system management dashboard

For customers who rely on HPE Performance Cluster Manager (HPCM) software to manage clusters via a Technical Preview in HPCM 1.5, we are now offering real-time anomaly detection on interface hardware such as CDUs and cooling racks using AIOps.

Cooling hardware is notoriously complex and difficult to manage via traditional threshold-based methods because these can produce too many meaningless alerts.

Instead, our AIOps models understand and analyze historical as well current data, linking anomalies and observed patterns to relevant events via machine learning. Example dashboards are shown in Figures 1 and 2.

HPE_HPCM-AIOps1.png

Figure 1: AIOps Single Metric dashboard contains plots of metric data values (blue line) for a single metric - CDU valve position in this case, the anomaly scores for the monitored metric (red line), and an anomaly threshold (yellow line). An alert is generated (and displayed on the system dashboard shown in Figure 2) when the anomaly score exceeds the anomaly threshold. The alert expires in the event that additional alerts are not generated during a predefined period of time.

aiop_single_metric_1.png

Figure 2: The AIOps Alert Report dashboard in HPE Performance Cluster Manager displays notifications of anomalies for cooling hardware. This dashboard also displays a pie chart showing where in the system alerts come from. 

What's next?

Using this feature, system administrators have a better chance to get ahead of hardware issues by getting alerts before their cooling systems may incur a failure. In near future, we plan to extend this capability to other elements of the systems, including CPUs, GPUs, and memory.

Combined with HPE Pointnext Services, AIOps will enable HPE to more quickly resolve system issues with associated data analytics and predictions. In some cases, it will be about getting ahead of potential problems. In other cases, it will be about reducing the time to solution for critical system problems.

That’s not all.

AIOps by HPE: the big picture

HPE is using AI and machine learning (ML) to develop advanced, non-threshold-based real-time analytics to reduce data center downtime via rapid and early anomaly detection that performs automatically.

Inherent in AIOps anomaly detection capabilities is the ability to infer impending failures. Ongoing developments include adding more predictive and optimization capabilities that can be used to improve data center energy efficiency and sustainability via the prediction of the power usage effectiveness (PUE), as well as predictive scheduling of cooling for large jobs, plus the optimization of the water usage effectiveness (WUE) and carbon usage effectiveness (CUE). The effort encompasses both IT systems and the supporting facility infrastructure.

AIOps for your data center

We understand that migration to AIOps will not happen overnight. Wherever you are on your AIOps journey, the expert AI consultants with HPE Pointnext are here to tailor the right solution to help your organization simplify management of your IT operations. Learn more about our operational support services. Or contact your HPE sales representative or authorized HPE Channel Partner.


Leslie Tung
Hewlett Packard Enterprise

twitter.com/hpe_hpc
linkedin.com/showcase/hpe-ai/
hpe.com/info/hpc


1 https://uptimeinstitute.com/data-center-outages-are-common-costly-and-preventable, (2018 - 2020)

2 https://facilityexecutive.com/2020/08/survey-of-data-center-operators-increasing-complexities-outages/

0 Kudos
About the Author

Leslie_Tung

HPC software helps solve the world’s most challenging problems. Leslie leads the HPC Software Product Management team at HPE responsible for managing a portfolio of HPE-engineered and third-party software for the HPE Cray EX Supercomputers and HPE Apollo HPC systems. The software portfolio includes system management software enabling system resiliency and DevOps software tuned for HPC and AI workloads.