AI Insights
cancel
Showing results for 
Search instead for 
Did you mean: 

IT is the modern utility of the digital world

JudyGoldman

Guest post by  Eric J. Bruno

Data center downtime can have devastating implications, including financial losses to business and irrevocable damage to a company's reputation. According to Rick Schuknecht from the Uptime Institute, 73 percent of data center downtime is caused by manual IT errors resulting from human mistakes, poor maintenance practices, and poor operational governance.

Often, the stress and added pressure from one mistake can lead to additional mistakes, compounding the situation. Sometimes outages occur from failed equipment or networking issues. However, most outages are caused by operator error. These manual errors are usually due to missing steps in a checklist or deviation from documented steps—either unintentional or in an attempt at a shortcut.

Famous last words: "We'll do it later"

Proper IT automation is the savior of IT. "Let's just get this up and running" and "we'll automate it later" are among the famous last words from operations. Later rarely comes, and manual processes, once entrenched, rarely get automated. There are always too many new things added to the to-do list. Ultimately, automating up front will save you later. The steps to avoid costly manual mistakes begin with IT automation.

Additionally, you should implement ongoing training on important procedures, such as how tsilhouetteofindustrialpy_271208.jpgo respond to system failure and downtime. In both of these cases, proper documentation is the first step. In fact, most of the top ten "to-do" items in this list from Data Center Frontier on maximizing reliability involve some form of documentation, which can be easier accomplished with the help of automation and orchestration. This includes operational process controls, training programs, formal change management, infrastructure management and monitoring, capacity planning, hardware and software lifecycle strategy, and geographic topology.

Performing each procedure from upgrades to new installations via the single click of a button means no chance to make mistakes in an error-prone list of manual steps, and a far higher level of security. For instance, have automated scripts run as users for whom few staff have the passwords—this will guarantee that only the scripts can perform the procedures. Next, require staff to log in through a management console or the scripts themselves to perform the procedure. This ensures that only authorized staff members execute the applicable automated procedures. Additionally, some automated frameworks tie into centralized administrative calendars, only allowing the procedure if it was pre-authorized in a schedule.

A personal security incident

It's in the vendor's best interests to keep its data centers and servers as secure as possible. With security and reliability being critical componenets of their business (in terms of server and infrastructure uptime), cloud providers have a lot at stake if they fail in this area. In a personal security incident of my own, a colleague and I once gave a talk at a large conference using my colleague's laptop. He had set his browser to save his user ID and password for nearly every site he used, including our corporation's cloud administration interface. The entire crowd noticed this, including our management in the audience, and we were asked to prove that we hadn't put our organization and our customers at risk in the process.

The lesson here is that you're only as secure as your weakest link in the chain. Fortunately my colleague's laptop was password protected and encrypted, and none of our cloud services experienced a breach. Had the situation been different, where someone took advantage of this oversight and accessed customer-sensitive data, the incident could have had legal and financial implications to the company and my colleague. The real solution here was to implement two-factor authentication to ensure saved passwords were never enough to get into critical systems. Had we done that from the start, it would have ultimately saved us from the wrath of our management, along with public embarrassment.

The "100-year" storm

Another personal incident arose during Hurricane Sandy in New York. In this case, manual procedures and even some scripted semi-automated procedures proved to be almost useless during disaster recovery. When water flooded part of our data center and power was out in lower Manhattan for almost two weeks, we were forced to replicate our IT environment elsewhere. Although all of our critical software resources were properly housed at an off-site storage facility, the procedures to install and restore them on new servers failed us because they relied on older (now unavailable) hardcoded network addresses. And because the network infrastructure changed with minor updates over time that no one documented, forensically recreating the network and the restore scripts for the new servers delayed us further. Of course, given there was already tremendous pressure to get up and running, tired and harried IT folks had the tendency to compound things by making mistakes. In a fully automated environment, fatigue would be inconsequential because IT wouldn't be forced to manually restore the network and respond to the constant disruptions. The moral of this story is to use automation software that takes into account the restoration of infrastructure and network topology, in addition to server software and configurations.

The cost of downtime

IT is the modern utility of the digital world. Just as other utilities (water and electrical suppliers) are regulated with penalties for outages, there are similar penalties for IT. Some systems have real monetary and human impact when down. Almost all others, even if not critical, will negatively affect a company's reputation, potentially resulting in real financial impact. In most cases, when customers turn to a competitor when your systems are down, they won't come back. The result can be job loss—perhaps even yours. Let this be motivation to avoid downtime through properly documented, automated IT procedures. Find the time today.

 

Judy-Anne Goldman
0 Kudos
About the Author

JudyGoldman

My work with HPE's Enterprise.nxt team gives me a way to share my passion for emerging technology. I love connecting people to innovation, and sharing stories that help others engage with and understand the world around them. I'm a digital nomad, often found traveling with my micro companion KC, a 10-pound mini Dachshund.

Labels