Operating System - Tru64 Unix
1752273 Members
4544 Online
108786 Solutions
New Discussion юеВ

Re: Collect I/O stats report

 
Christof Schoeman
Frequent Advisor

Re: Collect I/O stats report

The system is running Tru64 V5.1B PK4 on a 12 CPU 24GB memory GS1280.

The storage sits on an EMC DMX box, with the data SRDF'ed to a remote site.

We have run similar stats collections on other systems that have storage on the same DMX, but they all look fine.

We also failed the system over to the remote site, but we get the same problem.

You are right, it sounds all wrong. I'm starting to doubt my stats collectors (iostat, collect and monitor).
Han Pilmeyer
Esteemed Contributor

Re: Collect I/O stats report

I wouldn't trust the queue depth from monitor. That program wasn't really maintained in this millenium and there were changes around the device statistics.

I just happened to verify that for the BL24 (PK3) release and newer that the storage device statistics are correct for collect.

Is it possible that the SRDF link is "stalled" when you see those high I/O queues?

I'm not sure that I could match the colors in the graph to the statistics. Could you perhaps present cfilt output for a similar event?

What does the DDR entry for the EMC device look like? I assume EMC configured that correctly for you, right?
Aco Blazeski
Regular Advisor

Re: Collect I/O stats report

Hi to everyone,

Possible cause for such behaviour can be synchronization of two DMX-es, that is SRDF.
Check if synchronization between two DMXes is sync or asych.

We also have GS80 connected to a Symmetrix box which has SRDF to a remote site. And when we turn the synchronization on we experience bad disk performance for those disks which are synchronizing.

Other systems on DMX that work fine at your site maybe are not synchronizing with remote site.

Also you can try to turn off completely synchronization between DMX boxes for a while and then check the performances.

Also as a step in troubleshooting could be to take a look on disk usage on DMX-es through EMC control center software, not from the server side (i.e. monitor, iostat, collect...)

Hope this will help
Regards
Christof Schoeman
Frequent Advisor

Re: Collect I/O stats report

Hi

These are brilliant ideas. I know, 'cause I tried them as well:-)

Even with SRDF completely out of the picture (split), the problem still occurs.

DDR entries are correct (verified that).

The EMC engineers are about to start a trace on the FA's that this system is connected to. I'll also be collecting stats with iostat, monitor and collect, at 1 second intervals. I'll be posting some graphs soon.

Thanks for your help so far.
Christof Schoeman
Frequent Advisor

Re: Collect I/O stats report

While we're waiting for EMC to analyze their trace info, here are some results.

I wrote a little script that send a single I/O to each disk and then measures the time it takes for the I/O to complete. In this short collection time, there was at least 1 I/O that took 8 seconds to complete. They sometimes take up to 20 seconds and will happen on each disk (independently) within 30 minutes. This I/O was issued at exactly 13:38:37 and completed 8 seconds later. Now look at the graphs...

There isn't much happening on the disk, but a queue length of 36 pops out of nowhere, the service times shoot up and the I/O takes forever to complete.

Oh the humanity!
Han Pilmeyer
Esteemed Contributor

Re: Collect I/O stats report

Can you post the results of a "hwmgr -show fibr -adapt"?
Ivan Ferreira
Honored Contributor

Re: Collect I/O stats report

Just to make sure, aren't you getting swapping/paging activities?.

Is the swap area OUT of the SAN? Swap devices should be located on local disks.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Christof Schoeman
Frequent Advisor

Re: Collect I/O stats report

Paging/swapping activity is negligible.

Adapters:

# hwmgr -show fibr -adapt

ADAPTER LINK LINK FABRIC SCSI CARD
HWID: NAME STATE TYPE STATE BUS MODEL
--------------------------------------------------------------------------------
786: emx7 up point-to-point attached scsi11 FCA-2384

Revisions: driver 2.14 firmware 1.90A4
FC Address: 0x6a0070
TARGET: -1
WWPN/WWNN: 1000-0000-c93e-60ae 2000-0000-c93e-60ae

ADAPTER LINK LINK FABRIC SCSI CARD
HWID: NAME STATE TYPE STATE BUS MODEL
--------------------------------------------------------------------------------
51: emx0 up point-to-point attached scsi3 FCA-2384

Revisions: driver 2.14 firmware 1.90A4
FC Address: 0x650071
TARGET: -1
WWPN/WWNN: 1000-0000-c93e-ca10 2000-0000-c93e-ca10

ADAPTER LINK LINK FABRIC SCSI CARD
HWID: NAME STATE TYPE STATE BUS MODEL
--------------------------------------------------------------------------------
928: emx9 up point-to-point attached scsi12 FCA-2384

Revisions: driver 2.14 firmware 1.90A4
FC Address: 0x21300
TARGET: -1
WWPN/WWNN: 1000-0000-c93e-615a 2000-0000-c93e-615a

ADAPTER LINK LINK FABRIC SCSI CARD
HWID: NAME STATE TYPE STATE BUS MODEL
--------------------------------------------------------------------------------
955: emx11 down scsi13 FCA-2354

Revisions: driver 2.14 firmware 3.92A2
FC Address: 0x0
TARGET: -1
WWPN/WWNN: 1000-0000-c931-4bb4 2000-0000-c931-4bb4

ADAPTER LINK LINK FABRIC SCSI CARD
HWID: NAME STATE TYPE STATE BUS MODEL
--------------------------------------------------------------------------------
960: emx13 up point-to-point attached scsi14 FCA-2384

Revisions: driver 2.14 firmware 1.90A4
FC Address: 0x6b0002
TARGET: -1
WWPN/WWNN: 1000-0000-c93e-61c2 2000-0000-c93e-61c2
Han Pilmeyer
Esteemed Contributor

Re: Collect I/O stats report

There's a firmware issue with most of HBA's that you use (FCA-2384). According to the reports I don't think the problem is what you describe, but you may want to upgrade the firmware nevertheless.

http://h20000.www2.hp.com/bizsupport/TechSupport/DriverDownload.jsp?pnameOID=341798&locale=en_US&taskId=135&prodTypeId=12169&prodSeriesId=341796&swEnvOID=1048

There is NO issue at all having swap on SAN storage.
Christof Schoeman
Frequent Advisor

Re: Collect I/O stats report

The system is scheduled for a firmware and patch upgrade in the very near future.

Besides collect and monitor, is there another way to get disk queue information?