1752822 Members
4137 Online
108789 Solutions
New Discussion

Pinpoint I/O Spike

 
Steven Clementi
Honored Contributor

Pinpoint I/O Spike

Aside from looking at the stats per vLun (srstatvlun), is there another way to determine which volume is experiencing a greater communication hit which could be causing a spike in latency?

Scenario: Customer has two 3PAR 8440's in Peer Persistence configuration.  Sites are campus so the arrays are well within a several thousand feet from each other. SAN Switches on both sides using RCFC links.

At a specific time during the day, the arrays experience an uptick in I/O and Latency to the extent that the host ports (total) hit 1500 Queue Length, sometimes higher... but then quickly subsides (usually). When this happens, some of the remote copy groups cannot keep up and lose sync, applications go offline, issues experienced across the good portion of the environment.

-End Scenario

I've been trying to pinpoint a specific volume or host port that could lead to the discovery of the rogue server or virtual machine that is causing the issues, but am alo wondering if anything could be happening on the array itself since the issue is clockwork, every day at/around the same time.

Any thoughts or suggestions?

Steven Clementi
HP Master ASE, Storage, Servers, and Clustering
MCSE (NT 4.0, W2K, W2K3)
VCP (ESX2, Vi3, vSphere4, vSphere5, vSphere 6.x)
RHCE
NPP3 (Nutanix Platform Professional)
1 REPLY 1
rmay_bk
Valued Contributor

Re: Pinpoint I/O Spike

At the 3PAR CLI, "showtask -all" will indicate any tasks that might have run at the time of your activity spike.

Within the SSMC console, under Reports, there's a built-in template called "Exported Volumes - Compare by Performance" that can be configured to show, for example, a top 10 listing of volume IOPs and/or bandwidth.

Also, if the workload in question is VMware, vCenter's performance charting can provide many clues as well.

If your environment is like mine then you've got IT folks who aren't thinking about the consequences of starting many workload jobs or tasks all at the same time.  We even have a mass overnight reboot of RHEL VMs that creates a 3-4 minute IOPs storm.  There are band-aids such as volume IOPs limits, SIOC (in the case of VMware), and Priority Optimization on the array (if licensed).  You could schedule an Adaptive Optimization job that focuses on the time window encompassing your "storage storm."  But proper resolution is going to require education.