Digital Transformation
Showing results for 
Search instead for 
Did you mean: 

How controls, big data and automation work to deliver resilient IT


Big Data. It’s all over the place. It used to be that the IT industry was a place where outsiders peeked in with awe, but now everyone actually knows at least something about how to use the equivalent of a whole cluster of early PCs in their pocket.


Now Big Data is something we can see mentioned in TV advertising; it’s in the newspapers and of course Wikipedia, whose page talks about complex physics, biological and environmental research.


And it represents big opportunities too. Apparently the music industry is using analytics to help it see into the future. According to an article on, “The explosion of data from sources like torrenting, music streaming sites and social media platforms has offered the music industry a huge opportunity to understand their fans and spot upcoming artists like never before. Music analytics is now worth an estimated £1.8 billion per year.”


What about resilience?

Clearly to gain any benefit from this kind of technology you have to have a reliable IT platform. It’s the poor cousin in terms of newspaper column inches (blog inches?) but the work of keeping data centres running is critical to any business today.


So how does the IT team manage to keep a reliable, resilient IT platform in the face of ever-growing data stores, more demanding business applications and ballooning IT user populations inside and increasingly outside the organisation? Well… it needs help.


And some of that help comes in the form of the “Big Data” that IT systems have been collecting for years. We’re now in a position where all those log files and mountains of system and application data can be used to identify and even predict IT failures.


Causes of failure

Let’s take a step back and look at the common causes of IT failure. There’s always going to be the possibility of hardware failure. This is something that’s covered through redundancy, hot-swappable components and automated processes that keep systems running.


The area of failure that’s less well managed is where people get involved.


  • Did someone make a stupid change?
  • Was a required change not done?
  • Was a change not done quickly enough?

Change is at the root of many IT failures because an apparently simple modification of a complex system can have wide-ranging side-effects.


Complex systems

In our world of very complex, interconnected systems, it’s difficult for people to be in a position where they can foresee all impacts of an action, and that they can do the required actions at the right time.


The banking IT failures probe by the Financial Conduct Authority (FCA) in the UK is reviewing the systems of major lenders. They say that IT resilience will be a top priority for the regulator this year. With penalties that can now be levied on financial institutions this subject is high on the agenda in the finance industry.


But with active and vocal customer bases, IT resilience is also high on the agenda of many other industries. Where apps are provided by organisations to their customers, the tolerance towards system outages is now very low.


Delivering IT resilience Controlling change

“It ain’t broken so don’t fix it” is a good adage in most walks of life, not least in IT. However we do like to meddle! As I’ve established, change is at the root of the (lack of) resilience evil. So the first thing that is required is strict control over change.


With a strong change management process it becomes more likely that wider impacts can be spotted; that problems are identified before the change goes live; and that in the event of a problem in a live system, the change can be backed out quickly.


To achieve this kind of control requires some help in the form of automation. A system that automates the process of change is rigorous in completing all required steps in the right order; it avoids human errors; and is fast.


For example, patch management tools ensure that patches are applied quickly and accurately to all applicable systems in the right order. This means that any down-time is minimised while the whole process happens much faster and with less likelihood of error than with a manual process.



Hindsight is a wonderful thing, but foresight is much better. This is where the crystal ball of Big Data returns to the discussion.


Our computer systems and data centres have been collecting lots of logs, statistics, usage information etc. over the years. It is now possible to use the Big Data analytics techniques to analyse this information to not only help with predicting IT resource requirements, but also to identify when and how problems and failures might occur.


This can be an automated process or it could involve the very human ability to recognise patterns. Where information can be displayed in a graphical manner we humans are very good at spotting trends and anomalies.


Clearly if we know that something bad might happen using this “Operations Analytics” approach, it’s possible to properly plan a controlled change to avoid the problem. Big Data here is delivering “actionable insight” to provide a clear view of your IT infrastructure.



So we know there’s a problem – whether it’s capacity, high error rates or a patch that needs to be applied – how do we take action on this insight?


Automation is the best approach. It’s quick and accurate, and has the benefit that the steps can be reversed if necessary. Problem identification through Operations Analytics, combined with automated remediation is the way forward for modern, complex systems.


At the moment, automation covers the more routine kinds of event – things like the degradation of performance in a virtual machine, or other kinds of events where a series of remediation steps can be pre-defined. This means that the more regular problems are handled very quickly, leaving the administrator to focus on any more complex issue that arises.


Take the example of the poorly-performing virtual machine, which raises an event that is captured by the monitor. This in turn triggers a series of steps that migrate the deteriorating virtual machine to a server with excess capacity. This HP paper covers the subject in more detail.


Your experience

  • What IT problems have you witnessed?
  • Do your users see problems before the support team?
  • How automated is your remediation process?

Next time I’ll be looking at how the wider base of app users can be harnessed to help keep systems running.


Further reading

You might also be interested in Big Data Analytics for IT Operations and Business Service Management.



Alastair Corbett leads HP’s UK&I Software Business Unit and has responsibility for its strategy, the promotion and selling of the IT Performance Suite and related services. Prior to this role, Alastair was responsible for defining the new sales strategy and go-to Market models for Worldwide Software Sales, and before that, he successfully led the Worldwide Services Operations team for HP Software. Alastair joined HP from Peregrine as a result of the acquisition in 2005, where he held the role of VP International Operations and was responsible for all Finance and Operations activities in EMEA and APJ. He also led the integration activity for EMEA, as well as leading the Sales Operations function.


0 Kudos
About the Author


This account is for guest bloggers. The blog post will identify the blogger.

Jan 30-31, 2018
Expert Days - 2018
Visit this forum and get the schedules for online HPE Expert Days where you can talk to HPE product experts, R&D and support team members and get answ...
Read more
See posts for dates
HPE Webinars - 2018
Find out about this year's live broadcasts and on-demand webinars.
Read more
View all