Software Developers
Showing results for 
Search instead for 
Do you mean 

A perspective on the use of machine learning in management of complex applications

mrohad ‎04-29-2013 07:50 AM - edited ‎05-01-2013 09:37 AM

Post written by Ira Cohen


Data centers are the backbone of the 21st century economies. They are used to serve, process and store information about everything that happens electronically. That includes pretty much most of what we do today, both as consumers and as business entities. Everything is managed by data centers.


A data center includes the physical hardware – racks of computers, all connected through a physical network, internally and externally, to other data centers. But it is not just the hardware – there are the operating systems, middleware software, applications and services, which make up what we see as end users on our desktops, mobile phones or browsers. These are the layers of abstractions that make up complex applications and services that serve us in our daily lives.


Managing the complexity


As the world becomes more sophisticated, managing data centers becomes a more complex task. It is not just the scale of these data centers, growing from just hundreds of machines to thousands and up to hundreds of thousands of physical machines, and millions of entities to manage. It is also the multiple layers of abstractions, and complex interconnectivity, that makes managing the data centers difficult. 


Management of data centers requires visibility into what is happening. This leads to different management tools that monitor the various components that make up the data centers. These management tools expose their measurements to operators whose role is to make sure that all the different components run continuously and smoothly. Ironically, even though these data centers represent the state-of-the-art in computing, they are mainly managed by human operators.


However, as the complexity and scale of data centers increases, the amount of data collected by management tools grows exponentially. With this growth the need for automated tools that transform these mounds of data to actionable information becomes paramount. This is where machine learning comes into the picture…


The importance of machine learning


Machine learning is good at taking in data and providing some form of an answer, either numerical or categorical, describing concise information. For example, machine learning algorithms take an image as input, and in turn, output a categorization of whether there are faces in the image and where they are. In the context of managing complex applications in data centers, the input are all the different sources of monitoring data. The output is the detection, or prediction, of problems in the applications or its components, and better yet, the root cause and resolutions to those problems.


While it sounds simple enough, the task of building and applying such machine learning tools to create these outputs is quite difficult. It has been the source of intense academic research and product development for the past decade, with even longer roots.What makes these tasks so difficult? Why are we not seeing yet fully automated data centers that require very little human intervention in their management, all powered by machine learning algorithms? There are few probable reasons for it:


  • One reason rests with the way data centers are operated. In most companies there are different silos managing different parts of the data center. Different groups manage the machines, network, applications, middleware applications and services. Each group is responsible for their own part of the world. This historical partition led management software companies to create different management products that cater to these different groups. The data from the different products is not easy to combine to get an entire picture of the state of the data center on all levels.


Besides the data being stored in different databases and collected in different formats, each group demands a different kind of output, catering to their own view of the data center, but may be irrelevant or conflict with the overall goal of maintaining availability and performance of the applications and services running in the data center.


  • A second reason is that the underlying technologies which make up a data center keep changing at a relatively fast pace. This makes it hard to prove the results of machine learning algorithms. Perhaps the biggest changes of the last few years are the virtualization of machines and the availability of cloud services. While virtualization and cloud services simplify actions, such as adding machines and capacity, they introduce yet another layer of abstraction in the data center management. While this layer can server to decouple the physical infrastructure management from the application management, it requires major changes to the way that the data is collected and also analyzed using machine learning.


  • Finally, and perhaps most difficult to overcome, is the issue of trust in machine learning-based methods for data center management. For the people who manage the data centers, there is little tolerance for errors. Their jobs depend on the data centers running smoothly because it keeps the business running smoothly. The cost of mismanagement can be very high to the companies that depend on the data center. Therefore, any adoption of automated methods is done slowly and “black box” methods are greeted with suspicion.

In fact, many want to understand what’s happening under the hood, get results that are clearly understood by humans and outputs they can control. For many methods in machine learning, such requirements are difficult to provide, e.g., in classification problems many classification methods do not provide a human readable explanation to why a sample is classified with a certain label. Even those that can (e.g., decision trees), may be hard to understand by humans. The varying strengths of machine learning methods – the ability to handle high dimensional data and create complex rules that are not human readable, and handle uncertainty- are the very reasons why they are slow to adopt in data center management.


The future of machine learning


Despite these challenges, there are clear advances towards utilizing machine learning in the data center management. Besides academic research (conferences, research groups such as the RAD lab in Berkeley[1]), there are data analytic products emerging that are geared towards data center management, creating very large new markets. Data analysis algorithms are now in products that are used for software testing (correlations and clustering), change and configuration management (similarity algorithms), application performance management (multi-dimensional anomaly detection, correlations and clustering), automation of control actions (classification and clustering), to name a few.


Besides algorithms in new or existing products, we see an emergence of new companies, entirely based on data analytics for data center management. For example, Splunk[2], which is all about analyzing log files for various data center management tasks, had its recent IPO labeled as the first Big Data company to go public[3].


To summarize this post, I do believe that the future is bright for utilizing machine learning in managing our compute infrastructure. While it may still take some time and mind shift on part of those in charge of running the infrastructure, it will become virtually impossible to manage them without a lot more automation and data understanding, which Machine Learning can provide.


0 Kudos
About the Author


Nov 29 - Dec 1
Discover 2016 London
Learn how to thrive in a world of digital transformation at our biggest event of the year, Discover 2016 London, November 29 - December 1.
Read more
Each Month in 2016
Software Expert Days - 2016
Join us online to talk directly with our Software experts during online Expert Days. Find information here about past, current, and upcoming Expert Da...
Read more
View all