1753435 Members
4719 Online
108794 Solutions
New Discussion юеВ

Re: Disk Queue Length

 
Wim Van den Wyngaert
Honored Contributor

Re: Disk Queue Length

Volker,

It is 7.3, so no /ALL.

The DIOLM is indeed high (4096).

But how do you tell that the HSG80 is overloaded ? If thruput stays high I don't have a problem with long queues. But I like to know why and who is doing it.

Wim

Wim
Volker Halle
Honored Contributor

Re: Disk Queue Length

Wim,

the only indication of an 'overloaded' HSG80 - that I know about - is seen in the FC counters: QF seen = 'Queue Full seen' or
Seq Tmo = sequential timeouts. You need to check the counters on all your FC pathes from the node to the HSG80:

SDA> FC SHOW FGA0
SDA> FC STDT
SDA> FC SHOW FGB0
SDA> FC STDT
...

If IOs to the disk/path/HSG80 are temporarily stalled, the queue length will stay high, but no IOs will be processed, they will all be pending.

Volker.
Tom O'Toole
Respected Contributor

Re: Disk Queue Length

Have you run VTDPY on each controller in the pair to determine whether one or the other runs out of CPU or other resource. Maybe path switching to serve this unit from a controller which only has other units mostly idle at this time will improve recovery from this peak.
Can you imagine if we used PCs to manage our enterprise systems? ... oops.
Robert Gezelter
Honored Contributor

Re: Disk Queue Length

Wim,

A side note, not particularly on topic.

Taken to its extreme, a short queue (queue length = 1) is not necessarily a good thing. For example, seek optimization algorithms cannot optimize without a queue to analyze. If the HSG has a problem with long queues, it is a BUG. The driver and the controller should jointly ensure that the queue length does not cause controller problems.

Short of exhausting non-paged dynamic memory on the host OpenVMS system, and the performance problems caused by a dis-porportionate queue length on one particular device, there should be no problems with long queues.

Of course, the individual performance of a particular process is a different question. Most analyses of overall system performance maximize the overall performance of the system.

Using DIOLM and other account quotas to manage the workload on the HSG is crude and imprecise at best, and self-defeating at worst. It is true that for a particular configuration, increasing quotas beyond a certain point yields dramatically decreasing benefits, but that is an entirely different issue.

I hope that the above is helpful.

- Bob Gezelter, http://www.rlgsc.com
Wim Van den Wyngaert
Honored Contributor

Re: Disk Queue Length

Did some thinking at home. Maybe it is just that normally all write IO goes into the cache and is processed afterwards. But if the cache is full it may lead to queues. So, simply a longer interval of heavy load. I will check on Wednesday with the real data.

Tom : yes running vtdpy could give some extra data but I don' have permission to come at 6:00.

Wim@home
Wim
Wim Van den Wyngaert
Honored Contributor

Re: Disk Queue Length

It seems that I "fogot" this case.

I now have a simular case. FAL was doing non-virtual IO with a thruput of 2 - 6 MByte per second. This during 1 hour, so several GB. But the application guys don't know why.

Any idea's ?

The cluster was rebooted and the problem is gone. But I would like to know what it was.

Wim
Wim
Hein van den Heuvel
Honored Contributor

Re: Disk Queue Length

Wim, Is that really VMS 7.3 or VMS 7.3.x?

What Volker is suspecting but maybe not articulating strongly enough is that there is a serious performance problem once you hit a QFULL condition (credits) on the fibre channel driver. One spike, and you'll slow down after that.
The recovery for that is way too gentle/slow.
This is addressed by VMS732_FIBRE_SCSI-V0700.
The fact that it gets better after reboot kind of confirms this.
That potential alternative to this is supposdly:
SDA> FC SET WTID/WWID=wwid-number/QFTIMED=1


About the 6 am/ hourly spikes possibly related to sybase... Coudl sybase be doing like what oracle calls a checkpoint? A great many IOs in one go to sync memory with disks?
Is there a knob in sybase to limit this? Like a 'max-write-io' perhaps?

Now about Backup. The backup process settings with DIOLM many thousands is 'old school'. It goes well beyond the point of diminishing returns. Please check the current process quota recommendation, or just set it back to 100 and try.

met vriendelijke groetjes,

Hein.



Wim Van den Wyngaert
Honored Contributor

Re: Disk Queue Length

Hein,

It is 7.3 patched until the patches of 12-mar-2003.

Note that the reboot solved the problem that FAL does a lot of non-vir qio. But why ?

The queue length problem is still present and imo caused by high activity (controller saturated). At that moment a backup is busy and also some big FTP. And some smaller Sybase dumps. And the controller is also used by another cluster that is doing backups combined with heavy DWH activity.

I still have the peaks non-virtual qio but currently not when the queue length is high.

And Sybase has no checkpoint activity.
Wim