Networking
1767188 Members
5671 Online
108959 Solutions
New Article
NetworkExperts

Modern telemetry and the dawn of AIOps-based data center automation: Part 1

HPE20160526115_800_0_72_RGB.jpgThis blog is part one of a two-part series. The blog is authored by Jim Capobianco, Principal Product Manager at HPE Aruba Networking, with Scott Stevens, Field CTO for AMD Pensando, contributing. Read part 2 here.

Fundamentals of data center automation

The industry is on the cusp of achieving long-promised goals of data center automation. Artificial intelligence for IT operations (AIOps) engines are coming to market and rapidly growing in capabilities. A prerequisite to data center network automation is high resolution, near-real-time telemetry—the fuel of AI engines. This telemetry, working in a closed loop with network elements, enforces and automates proper network operations and can advance data center network automation, simplification, and cost reduction.

Part 1 focuses on data center network architecture evolution and distributed services switching architecture.

Part 2 focuses on the newer telemetry paradigm and fulfillment of AIOps-enabled DCN automation. Read part 2 here.

4th generation data center network distributed services architecture

Early data center networks were very straightforward, offering logical and hierarchical progression of L2 and L3 switching in a tiered structure to fit the needs of the time (see figure 1). Third-gen data center networks aimed to solve the massive growth in both physical and VM/containerized data center networked elements. While 3rd gen spine-leaf switching fabrics are robust and scalable, they are complex to manage and inefficient in accommodation of network services.

Figure 1: Data center market trendsFigure 1: Data center market trends

 The proliferation of virtualized workloads has resulted in ~70% of network traffic moving east-west (between server racks in the data center). Network services (e.g., segmentation, security/firewall, encryption, NAT) are “bolted on” to the fabric rather than designed into the architecture. These services are typically implemented in end-of-row rack of equipment and are a major cause of east-west traffic.[1]

East-west exchanges necessitate “tromboning” to the spine layer, deepening network inefficiencies and raising costs as network devices must accommodate higher traffic loads.  Hyperscale providers (such as Amazon Web Services, Google Cloud, Microsoft Azure) solved 3rd gen data center network limitations by building custom network services (i.e., security), policy engines, AI-analytics engines and near real-time telemetry capability with closed-loop feedback systems directly into the network fabric to automate and secure their platforms.

Technology and market trends are now allowing similar enhancements to be performed cost effectively in enterprise data center environments as well (see figure 2).

Figure 2:  Generations of data center network architecturesFigure 2:  Generations of data center network architectures

 At HPE Aruba Networking, along with our DPU technology partner AMD Pensando, we refer to this 4th generation as the dawn of the distributed services architecture.  This is a true paradigm shift from 3rd generation data center switching architectures and is unique as it:

  • Integrates network services operations directly into the switching fabric (for policy enforcement) and
  • Provides significant enhancement in telemetry required to fuel AIOps.

Both points are required to facilitate automation of data center operations.

Traditional telemetry mechanisms are not up to the task

Since the beginning of networking itself, network devices (rather than agents or dedicated telemetry appliances) have been the primary sources of telemetry data.

There are many different types of telemetry, and these can be measured along different dimensions. Network telemetry data types include metrics, events, and logs.  Dimensions categories include packet level, flow level, security level, and application level. Each of these can have MELT telemetry types. 

While evolving, the most common data collection methodology remains simple “request/response” of specific data. SNMP telemetry protocol is request/response and is still widely used for telemetry data, but has several limitations:

  • Slow (vs. new alternatives)
  • Not real time
  • Not flow-based
  • Not streaming protocol

Most flow-level MELT telemetry is S-flow data, which provides only a “sampling” of the flow. Sampled telemetry takes snapshots in time, rather than all the telemetry data, and is often very low resolution (as little as 1 in every 8,000 flows, or .0125% of all traffic.[2] This is analogous to taking a few pictures at random intervals of your child’s sporting event, rather than video of the entire event. With random photos, it is unlikely you will capture your child’s plays.

The reason these sampled methods were used is more a limitation in processing power than in telemetry collector storage capacity.  Traditional network processing units (NPUs) simply do not have the resources needed to capture and transmit enough telemetry to achieve robust network automation. 

A bolted-on alternative to telemetry data generation is to deploy telemetry software agents or dedicated probes devices. While this method can provide some additional telemetry, it adds additional systems to manage and is costly to deploy and operate.  Additionally, these tools are often not deployed in the proper location to detect transient network issues—making them ineffective until they can capture the occurrence of an issue.

No matter what the telemetry device is, many traditional telemetry sources provide very low resolution. Feeding AI analytics engines with insufficient data can generate inaccurate responses, often referred to as “hallucinations.”   Obviously, hallucinations result in automation systems that cannot be trusted, and thus are not valuable.

Conclusions of Part 1

AI operational engines (AIOps) hold the promise of data center automation.  The fuel for these engines is advanced telemetry as traditional telemetry methods are simply inadequate.  An HPE Aruba Networking distributed services architecture provides advanced capabilities of real-time telemetry and closed loop enforcement of policies to meet the requirements of data center automation. Read part 2 of this blog series, which discusses the. new telemetry paradigm and closed loop AIOps for data center automation. 

 

Sources:

[1] ESG White Paper: Creating a Distributed Services Architecture in Existing Data Center Environments  

[2] HPE Aruba Networking Blogs, “Get rich data center telemetry with DPU-powered switches,” by Scott Stevens, Field CTO, AMD Pensando 

0 Kudos
About the Author

NetworkExperts