The Cloud Experience Everywhere
cancel
Showing results for 
Search instead for 
Did you mean: 

From Crisis Management to the New Normal for Data Centers

What do cloud infrastructures and on-premises IT have in common? Data centers. Given their criticality, data centers are typically subject to extremely stringent crisis management policies. How do those policies apply currently, and what changes might apply for the new normal?

Data center critical facilities operators have faced natural disasters that impact entire regions, national grid power failures, as well as other malicious and terrorist attacks. Previous disasters taught governments and global corporations how to react to the failure or unavailability of critical business systems and the facility systems that support them through reactive business continuity plans. Very few of those plans will have contingency for a disaster that immediately shuts down global societal movement and commerce at all levels. Data center teams are having to react to the unavailability of direct and out-sourced staff to continuously operate their facilities.

Data center business continuity.jpgRegardless of the crisis type of any single event – whether localized, worldwide or global – data center operators have serious responsibilities to maintain business continuity for their facilities. Many of these facilities are essential, connected national infrastructures that communities and governments depend on to maintain anything and everything from critical commerce, to communication networks, to government services, to financial services and social networks. As government, commerce, and society today have complete reliance on communication and information technology, it is in everyone’s interests to get these facilities and organizations ready.

Different countries have reacted in different ways to the notion that data center staff are “essential” in terms of “keeping the lights on” (see this article from DCD). At the same time, there is the role of the data center for each of us as users of data. As early as mid-March, data from one broadband provider was showing that home-based business hours broadband consumption had risen by more than 41%, according to this Forbes article. And that’s before we get to peak viewing hours and the rise in streaming services. All of that data is in a data center somewhere.

A response roadmap

Response to crises requires a methodical framework. Facility owners typically follow industry standards such as Building Industry Consulting Service International (BICSI) and ANSI/TIA standards for data center design. These industry standards and best practices revolve around the availability and survivability of the facility, and not overall organizational business continuity, including staff availability.

HPE has identified a nine-step framework for organizations, which can also apply to the specifics of data center facility continuity policy – see this article by Rohit Dixit, SVP and GM for HPE Advisory & Professional Services: Nine steps to the new normal. The framework has two phases: Immediate Crisis Management and Bridge to the New Normal. Within this framework, HPE Critical Facilities experts have created a structured approach to provide response to the crisis in a context that addresses risks and actions in an order of criticality and priority.

Crisis response framework.png

 

The framework builds a criticality timeline that complements crisis management fundamentals for immediate response, pivoting to new conditions and long-term planning. The framework establishes improved controls and response with the ultimate goals of reducing impact to business and faster recovery to the new normal. The timeline breaks down the actions that the facility owner needs to perform into 3 tracks: immediate actions, near-future activities and long-term planning:

Phases.JPG

 

Immediate actions

Immediate response actions are enacted without delay, as soon as possible, utilizing an all-hands-on-deck, fire-fighting attitude. This includes an alert to senior management for the immediate needs, such as staffing contingency, tools, and funding resources. There also needs to be a capability to create and address high-level escalations to be quickly effective in order to maintain minimum viable business continuity. This track must be completed in the first 1-2 weeks of the crisis or event.

In initiating the crisis response phase, consider the People First approach, where your people are your most important asset to help you execute a crisis response successfully, help transition your business towards continuity operations, and eventually enable you to return to normal operations. Fully protect key people as well as their families. You may need to triage your business priorities. Quickly assemble a “Crisis Management SWAT Team” to be led by experts from your staff who will bring their consulting skills and knowledge and their understanding of the little details that matter the most.

Adjust business practices by restricting physical access to the critical facilities spaces. You may want to limit entry to data centers only to mission-critical staff/persons, and forbid everyone else. For the first 72 hours, consider a complete lockout to gain control of the environment to help your teams focus on the immediate response actions. Communicate clearly to inform your customers, as applicable, that the data center is locked-out for any non-essential work, which will be postponed to a timeframe that works for your type of facility and capabilities.

To bring more stability to your operations, also communicate with your customers to check their own disaster recovery plans. Request authorization to externalize all latest critical backup tapes and support devices in order to coordinate activities and the needs of your customers. Communication is key, as the primary goal is to continue business activities – to carry your business into the post-recovery stage and continue to serve your customers with the least damage.

Conduct deep cleaning procedures for the physical infrastructure and common space. In the case of a pandemic disaster, enforce mandatory rules for all individuals to wash hands upon arrival at the reception area of the data center. Provide protection – gloves and masks – for all staff and guests who access your facility. Preparedness planning means having such items on hand for future events.

In order to sustain business operations, spread the work over 24-hour periods by staggering shifts to reduce the number of staff who are physically present at the data center at a given point in time. In planning work hours and shifts, consider that you may be impacted by curfews and stay-at-home orders. As staff availability to access the site could be compromised, establish a resources skill-map. Use the skill-map to split employees into dedicated teams that work together, and don’t mix the teams. Initiate programs for team-member cross-coaching using the skill-map as a guide to indicate availability.

Near-future activities

Near-future activities are designed to help pivot organizations to address and adjust to the new conditions. This is achieved through observation and alignment, and in short order enables you to maintain reasonable business continuity. This track must be completed in the first 3-6 weeks of the crisis or event.

As you prepare to transition to the new conditions as the new normal, assess critical business scenarios by developing 3 scenarios: optimistic, neutral, and pessimistic. Ask the principal business owners what objectives will be most important in a week, month, quarter, etc. Based on that information, identify what needs to be managed to fulfil needs and support the data center function in each scenario:

  • Optimistic: unexpected events will slow down the impact of the disaster.
  • Neutral: the crisis continues at the same rate and same impacts.
  • Pessimistic: the crisis worsens with a larger impact to the business or much longer than anticipated.

Given the impact on commerce during crisis events, consider the supply chain aspects of your operation. Start by assessing the spare parts supply chain, checking with your vendors on their own supply chain availability for all spare parts and consumables. It is important for your teams to make clear plans for their needs – with timelines – in order to communicate those to the vendors on an as-needed basis instead of over-ordering parts which may end up not being used.

With the immediate response actions in play, you can start transitioning to Bridge to the New Normal. Begin by observing and building a contingency plan for critical operations and maintenance resources. Continue to organize the transfer of expertise or training for knowledge overlap, in order to have at least 3 people able to do a given task.

As you align to the new conditions, consider rescheduling non-critical deliveries to your critical facilities sites until you learn more about the crisis and are able to have plans and actions in place to respond if deliveries negatively impact your facility or your ability to conduct business.

Look again at the business priorities you identified earlier; validate and streamline them against your overall business objectives. Again, consider the health and safety of staff, customers and others, and consider all elements in terms of creating a balance with business objectives, goals and needs.

Start analyzing and designing new processes and procedures to address any lack of preventive maintenance. Consider the scenarios and impacts for all of your equipment and systems if they were not to receive preventive maintenance for at least 3-6 months. Consult with your staff and any services vendors on their ability to perform such operations and maintenance, as well as the potential impact of the lack of such maintenance, and provision critical parts as needed.

To continue adjusting to the new conditions, conduct a test of the disaster recovery plan. This can take the form of switchover/switchback live-test during non-peak (after-business) hours, as applicable. Additionally, consider utilizing electronic forms using structured database software to help organize, plan, execute and track maintenance using Methods of Procedure (MOPs) and Standard Operating Procedures (SOPs).

Long-term planning

Long-term planning is intended to consider how to manage change, bring lessons learned in order to design new business processes, transform the business operation, and optimize overall business continuity. This track must be completed once the crisis mode has normalized in order to maintain ongoing business continuity. Depending on the type of crisis, this could be within the 3 or 6 months expected for the crisis or event to subside.

Again depending on the type of crisis, site infrastructure can also be vulnerable. It is prudent to check the site power generation and consider replenishing fuel supply for 4-6 weeks operation in the case of regional power failures. You will want to also treat existing fuel storage to maintain high quality. Continue to work with your fuel supplier to arrange deliveries.

With the conditions of the new normal likely setting in within this time frame, change your focus to be ready to be flexible to respond to changing needs.

Finally, become involved! Optimize your business and operation, and ultimately your brand, by contributing to regulatory and standard enhancements. You can start by building your own standards. Or influence those of international data center organizations such as the Uptime Institute’s site availability guidelines, BICSI-002 and ANSI/TIA-942 standards for data centers design, and ASHRAE recommendations for IT systems cooling and efficiency, as well as colocation and hyperscale data center operators – you can help to create a specific non-government track for data center resiliency so that businesses can be prepared for the next crisis.

Recommendations

The data center critical facilities community had always considered the reliability and availability of facility support systems, but with less consideration for the human element. While planners always considered the availability of skilled staff to plan, build and operate the facility, the operational practices and industry standards haven’t adequately considered disaster modes where essential staff is unable, not available or not allowed to work on-site to perform critical tasks.

For long-term planning, beyond management systems such as data center infrastructure management (DCIM) and Computerized Maintenance Management System (CMMS), data center operators should consider new, creative ways to build higher availability into their staffing and operations models. Once the knowledge-sharing and knowledge-transfer initiatives are implemented, they can consider new methods to perform physical on-site operations and maintenance (O&M) management tasks by leveraging electronic tools. These could be tools such as:

  • Building Information Modeling (BIM) for complete electronic modeling of the facility
  • Virtual Reality/Augmented Reality (VR/AR) to help troubleshoot complex systems to fill any skill-gap created by staff unavailability for collaboration between expert and less-skilled staff
  • Facility system Digital Twins to replicate physical systems into a digital form that performs scenario modeling simulation
  • System failure predictive analytics in order to enhance scheduled preventive maintenance into predictive targeted maintenance.

Beyond providing prescriptive actions, the nine-step framework suggested by HPE Pointnext Services has at its core a desire to bring beneficial learnings from the crisis into longer-term digital transformation strategic planning. As far as that applies to data center owners and operators, they can learn from the crisis response to build and enhance existing and new tools, knowledge, procedures, and reserves of various inventories, as well as increasing business agility during a crisis to preserve the best ability to communicate with and serve their customers and ensure satisfaction.

Historically, major local or global disasters have directly impacted the behavior of businesses and facility operators in ways such as adjusting risk acceptance, facility designs and operational procedures. We will be seeing further adjustments to how organizations source their critical facilities and IT systems for their essential operations. Some trends we’ll see include an increase in organizations relinquishing their own critical facilities and co-locating their IT systems with professional data center facility providers, along with increased waves of application migrations to cloud providers. New trends for facility designs that consider global resiliency of the essential infrastructure will get further attention from both private and public entities, with likely new regulations for essential critical facilities infrastructure.

With the amount of data center offshoring that takes place, one of the realizations as a result of this particular crisis ought to be that as data centers are the data workhorses of global economic and communications activity, they should perhaps be classified as being in the national and international public interest, with a formal “essential” classification.

It is time to start getting ready for the new normal – to be part of shaping it rather than reacting to it.

Stay safe,

Omar

Featured article: Nine steps to the new normal via HPE's Enterprise.nxt

Want to know the future of technology? Sign up for weekly insights and resources

Omar Elissa
Hewlett Packard Enterprise

twitter.com/HPE_Pointnext
linkedin.com/showcase/hpe-technology-services/
hpe.com/pointnext

About the Author

Omar_Elissa

I am a Worldwide Practice Principal, Technology Infrastructure, with HPE Pointnext Advisory Services, Data Center Facilities. I'm a 23-year veteran of the critical facilities industry, with a focus on delivering engineering services as a project lead.