Advancing Life & Work

AIOps for High-Performance Computing Data Centers and Beyond

Hewlett Packard Labs has made great progress over the past year to advance the use of artificial intelligence (AI) and machine learning (ML) for IT operations and automation, also known as AIOps. Labs research advances AIOps in system scale by focusing on leadership class machines in the exascale generation, in extending operational data beyond IT, and expanding to the data center facilities edge. Labs has also developed new machine learning and deep learning algorithms, which are built on open software architecture, hardware-agnostic, and based on open source components.


AIOps for Advanced Anomaly Detection
What’s become more and more clear is that AIOps improves efficiency, sustainability, and resiliency for high-performance computing in edge-to-supercomputer domains.

We can take the learnings — and more specifically the models that we train on high-performance computing (HPC) systems — into enterprise data centers, to help anomaly detection for our customers,” says Paolo Faraboschi, HPE Fellow and Vice President, Hewlett Packard Labs.

Many of the new workloads — for example, AI training — look a lot more like HPC and they are very different from traditional cloud and enterprise: very complex, long running applications, taking over many servers with heterogeneous components like GPUs and accelerators. Root causing anomalies can enable efficiency and cost optimization.”

With ever-increasing amounts of data generated at the edge, AIOps can be particularly effective.

“It’s quite common to deploy AIOps models at edge devices where data is being generated and collected,” says Sergey Serebryakov, Senior Research Engineer, Hewlett Packard Labs. “Our AIOps team has submitted a number of invention disclosures about concept drift detection and compensation. Concept drift detection is identifying when input data has changed enough so that ML models need to be retrained. Drift compensation is the ability of a system or ML model to work with data sources and sensors that have drifted from their calibrated levels.”

“We can also expand the use of AIOps to other heavy edge systems in the HPC space,” Faraboschi says. “The post-exascale generation will be a continuum of compute capabilities from the supercomputing core to the experimental instrumentation edge. Extending the use of AI technology to improve and automate the operation will be key to improve the efficiency of the combined system.”


AIOps Improves Energy Efficiency for NREL’s Data Center
The AIOps research effort is a collaboration between the National Renewable Energy Laboratory (NREL) together with HPE’s High Performance Computing team (HPE HPC) and Labs. To give an update of their work, Sergey Serebryakov (Hewlett Packard Labs), Tahir Cader (HPE HPC), and David Sickinger (NREL) virtually attended the SC20 Conference in November 2020 and delivered their presentation “AIOps: Leveraging AI/ML for Data Center Resiliency and Energy Efficiency.” 

“I think this is a great opportunity because there are a number of open source packages that provide ML algorithms for anomaly detection that work well on clean, academic-type data sets, but by working with the NREL, we have access to data center telemetry data and have an opportunity to test how they work out-of-the-box on the real data and propose improvements,” says Serebryakov.

HPE and NREL are using over five years of data, totaling over sixteen terabytes, collected from sensors in NREL’s supercomputers and its facility to train models for anomaly detection and prevent issues before they occur.

“The availability of real data from such a complex data center as NREL’s is really exciting,” Serebryakov says. “We have access to the real-time data and their infrastructure, so I have an account in their data center and a role, and we have dedicated hardware there. We're not only developing these models, but we can go and deploy these models right away and see how they work in real production environments and get real-time feedback.”

“Our focus has been on facility metrics ¬— coolant distribution units (CDU) and cooling rack controllers (CRC). Our advanced, real-time models enable early detection to respond and prevent catastrophic events, such as a shutdown. Using historical data, we can predict critical incidents and identify hardware-related events that may occur at NREL’s data center,” Serebryakov says. “We’re also developing a unique system that can train models at scale to process thousands of metrics, doesn’t require labeled data, and can evaluate models in an unsupervised fashion.”

About Sergey Serebryakov

Sergey Serebryakov is as Senior Research Engineer at Hewlett Packard Labs. His research interests include machine learning, as well as deep learning and its applications. Sergey received a Ph.D. from the Saint-Petersburg Institute of Informatics and Automation. 

About Paolo Faraboschi
Paolo Faraboschi is an HPE Fellow and Vice President and directs the AI Research Lab at Hewlett Packard Labs. Paolo is an IEEE Fellow (2014) for "contributions to embedded processor architecture and system-on-chip technology," author of 55 granted patents, over 100 publications, and the book “Embedded Computing: a VLIW Approach.” Paolo received a Ph.D. in EECS from the University of Genoa, Italy.


Curt Hopkins
Hewlett Packard Enterprise

0 Kudos
About the Author


Managing Editor, Hewlett Packard Labs