- Community Home
- >
- Servers and Operating Systems
- >
- Servers & Systems: The Right Compute
- >
- AIOps in production: Discover a smarter way to man...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Receive email notifications
- Printer Friendly Page
- Report Inappropriate Content
AIOps in production: Discover a smarter way to manage HPC systems
Learn how HPE is using artificial intelligence and machine learning to develop advanced, non-threshold-based real-time analytics to reduce data center downtime via rapid and early automated anomaly detection.
Imagine this scenario: You are a system administrator running the day-to-day operations of a large high-performance computing (HPC) center. You are just about to sit down to drink your morning coffee when the receptionist calls to check if you approve access to the data center for HPE technicians. “Thanks. Just let them in,” you say. You knew that they would be coming. You received an email from HPE yesterday stating that CDU number P245 in aisle #17 was going to fail so they would be sending service folks today to take care of that. Back to your coffee then. This is just another service visit after all.
HPE might not offer this kind of service just yet (it is on our roadmap!), but we already offer our customers real-time anomaly detection for their hardware using artificial intelligence (AI). This means that you can take corrective action faster and prevent system failures.
The larger the system, the bigger the problem
Managing HPC systems has always been challenging. Administrators need to sift through loads of monitoring data and multiple dashboards to identify system issues. Traditional threshold-based monitoring leads to too many false alarms. Miss a critical alarm and you are held accountable for a costly system downtime. As HPC systems are growing larger and more complex, the amount of data that operations teams need to analyze to perform their work is growing exponentially. And their job is getting much more difficult.
According to the Uptime Institute, nearly 50% of all data centers they interviewed experienced a significant outage in the last couple of years.160% of survey respondents also believed that their most recent significant failure could have been mitigated or prevented – this is an area in which anomaly detection plays a large role.2
That is why HPE is now using AI to help customers simplify management their IT operations, uncover issues, and react to them faster. Introducing AIOps.
AIOps on your system management dashboard
For customers who rely on HPE Performance Cluster Manager (HPCM) software to manage clusters via a Technical Preview in HPCM 1.5, we are now offering real-time anomaly detection on interface hardware such as CDUs and cooling racks using AIOps.
Cooling hardware is notoriously complex and difficult to manage via traditional threshold-based methods because these can produce too many meaningless alerts.
Instead, our AIOps models understand and analyze historical as well current data, linking anomalies and observed patterns to relevant events via machine learning. Example dashboards are shown in Figures 1 and 2.
Figure 1: AIOps Single Metric dashboard contains plots of metric data values (blue line) for a single metric - CDU valve position in this case, the anomaly scores for the monitored metric (red line), and an anomaly threshold (yellow line). An alert is generated (and displayed on the system dashboard shown in Figure 2) when the anomaly score exceeds the anomaly threshold. The alert expires in the event that additional alerts are not generated during a predefined period of time.
Figure 2: The AIOps Alert Report dashboard in HPE Performance Cluster Manager displays notifications of anomalies for cooling hardware. This dashboard also displays a pie chart showing where in the system alerts come from.
What's next?
Using this feature, system administrators have a better chance to get ahead of hardware issues by getting alerts before their cooling systems may incur a failure. In near future, we plan to extend this capability to other elements of the systems, including CPUs, GPUs, and memory.
Combined with HPE Pointnext Services, AIOps will enable HPE to more quickly resolve system issues with associated data analytics and predictions. In some cases, it will be about getting ahead of potential problems. In other cases, it will be about reducing the time to solution for critical system problems.
That’s not all.
AIOps by HPE: the big picture
HPE is using AI and machine learning (ML) to develop advanced, non-threshold-based real-time analytics to reduce data center downtime via rapid and early anomaly detection that performs automatically.
Inherent in AIOps anomaly detection capabilities is the ability to infer impending failures. Ongoing developments include adding more predictive and optimization capabilities that can be used to improve data center energy efficiency and sustainability via the prediction of the power usage effectiveness (PUE), as well as predictive scheduling of cooling for large jobs, plus the optimization of the water usage effectiveness (WUE) and carbon usage effectiveness (CUE). The effort encompasses both IT systems and the supporting facility infrastructure.
AIOps for your data center
We understand that migration to AIOps will not happen overnight. Wherever you are on your AIOps journey, the expert AI consultants with HPE Pointnext are here to tailor the right solution to help your organization simplify management of your IT operations. Learn more about our operational support services. Or contact your HPE sales representative or authorized HPE Channel Partner.
Leslie Tung
Hewlett Packard Enterprise
twitter.com/hpe_hpc
linkedin.com/showcase/hpe-ai/
hpe.com/info/hpc
1 https://uptimeinstitute.com/data-center-outages-are-common-costly-and-preventable, (2018 - 2020)
Leslie_Tung
HPC software helps solve the world’s most challenging problems. Leslie leads the HPC Software Product Management team at HPE responsible for managing a portfolio of HPE-engineered and third-party software for the HPE Cray EX Supercomputers and HPE Apollo HPC systems. The software portfolio includes system management software enabling system resiliency and DevOps software tuned for HPC and AI workloads.
- Back to Blog
- Newer Article
- Older Article
- Dale Brown on: Going beyond large language models with smart appl...
- alimohammadi on: How to choose the right HPE ProLiant Gen11 AMD ser...
- Jams_C_Servers on: If you’re not using Compute Ops Management yet, yo...
- AmitSharmaAPJ on: HPE servers and AMD EPYC™ 9004X CPUs accelerate te...
- AmandaC1 on: HPE Superdome Flex family earns highest availabili...
- ComputeExperts on: New release: What you need to know about HPE OneVi...
- JimLoi on: 5 things to consider before moving mission-critica...
- Jim Loiacono on: Confused with RISE with SAP S/4HANA options? Let m...
- kambizhakimi23 on: HPE extends supply chain security by adding AMD EP...
- pavement on: Tech Tip: Why you really don’t need VLANs and why ...
-
COMPOSABLE
77 -
CORE AND EDGE COMPUTE
146 -
CORE COMPUTE
131 -
HPC & SUPERCOMPUTING
131 -
Mission Critical
86 -
SMB
169