Operating System - OpenVMS
1748156 Members
4282 Online
108758 Solutions
New Discussion юеВ

High disk IO and low hit ratio

 
Grzegorz Pawlowski
New Member

High disk IO and low hit ratio

Recently on one cluster (3xES40) we had some memory issues which coused crashes.
Memory banks were replaced but now we face some High IO and CPU issues.

I found that one disk with Oracle RDB has higher IOs that it suposed to have.
On twin cluster IO are 30 and here are around 210.
Disk are shadowsets from SCSI HZ80.

I've noticed that also memory hit ratio is 64% on bad cluster whereas on good one we have 98%.

Also in RDB statistics we see that there are a lot of direct reads and writes.

Please help me what can I check and what to do.
My two clusters are identical with HW and SW but with the same load there is 2,5 difference in CPU usage becouse of this IO and memory.
10 REPLIES 10
abrsvc
Respected Contributor

Re: High disk IO and low hit ratio

A true answer may require additional information, but for starts:

1) Can you please describe the actual hardware configuration including controllers.

2) Version of VMS etc.?

You indicate that both clusters are identical. Is this true of the sysgen parameter setup as well? (Other than node specific names and adddresses, etc.)

Thanks,
Dan
Hoff
Honored Contributor

Re: High disk IO and low hit ratio

This reeks of a physical memory downgrade; of the removal of part of the memory after those "memory banks were replaced".
Grzegorz Pawlowski
New Member

Re: High disk IO and low hit ratio

1)
Controller:
HSZ80 ZG94710176 Software V83Z-0, Hardware E04
NODE_ID = 0000-0000-0000-0000
controller cache is good and most of the disks run from this storage but only with one we have problem.

Disks are COMPAQ BD009122C6


2) Version of VMS etc.?

With the sysgen parameter setup as Well.
Actually it is almost "out of the shelf" product so it's copy paste based installation.

VMS V7.3-2

I was running line by line comparation of system parameters and database configuration.

FR51:SYSTEM> show mem /cache
System Memory Resources on 9-DEC-2010 15:49:23.51

Extended File Cache (Time of last reset: 1-DEC-2010 05:16:27.56)
Allocated (GBytes) 2.46 Maximum size (GBytes) 4.00
Free (GBytes) 0.00 Minimum size (GBytes) 0.00
In use (GBytes) 2.46 Percentage Read I/Os 37%
Read hit rate 62% Write hit rate 0%
Read I/O count 42800824 Write I/O count 72721312
Read hit count 26682795 Write hit count 0
Reads bypassing cache 45 Writes bypassing cache 0
Files cached open 654 Files cached closed 99
Vols in Full XFC mode 0 Vols in VIOC Compatible mode 24
Vols in No Caching mode 0 Vols in Perm. No Caching mode 0

Write Bitmap (WBM) Memory Summary
Local bitmap count: 48 Local bitmap memory usage (MB) 1.78
Master bitmap count: 24 Master bitmap memory usage (KB) 912.00

FR61:SMSC> show mem /cache
System Memory Resources on 9-DEC-2010 15:49:30.96

Extended File Cache (Time of last reset: 19-AUG-2010 08:41:17.07)
Allocated (GBytes) 2.49 Maximum size (GBytes) 4.00
Free (GBytes) 0.00 Minimum size (GBytes) 0.00
In use (GBytes) 2.49 Percentage Read I/Os 34%
Read hit rate 96% Write hit rate 0%
Read I/O count 279294453 Write I/O count 538704240
Read hit count 270892352 Write hit count 0
Reads bypassing cache 416 Writes bypassing cache 1313869
Files cached open 662 Files cached closed 100
Vols in Full XFC mode 0 Vols in VIOC Compatible mode 24
Vols in No Caching mode 0 Vols in Perm. No Caching mode 0

Write Bitmap (WBM) Memory Summary
Local bitmap count: 48 Local bitmap memory usage (MB) 1.78
Master bitmap count: 24 Master bitmap memory usage (KB) 912.00


Thanks,
Dan
Grzegorz Pawlowski
New Member

Re: High disk IO and low hit ratio

Hoff,

First bank was replaced as it was rejected by the system at the startup test after crash.
Second was replaced as there were some "corectable parity errors".

HP support stated that now everything should be ok with memory.

Do you think that clusterwide reboot can help with this issue?
DB was the only thing that was not rebooted after memory exchange.
abrsvc
Respected Contributor

Re: High disk IO and low hit ratio

My first impression is that you are not comparing correctly. One machine has been up for a longer period and that alone will skew the stats a bit. The only fair test is to note the read and write stats (counts only) before each test and adgain when the tests complete. Compare the hard counts and the cache hits between those two points. That at least will give you a more accurate comparison. While not 10% it will be a better comparison that the overall stats.

Start there are see what the true difference is. Report that here please.

Thanks,
Dan
Hoff
Honored Contributor

Re: High disk IO and low hit ratio

>HP support stated that now everything should be ok with memory.

"Trust, but verify."

I'd confirm the quantity of viable physical memory within the server is the same both before and after the repairs, and I'd also confirm that the hardware caches are not shut off.

Last I checked, Rdb didn't use XFC.

This pairing:

>Reads bypassing cache 45 Writes bypassing cache 0

>Reads bypassing cache 416 Writes bypassing cache 1313869

is interesting. That reeks of database activity (as databases tend to have their own internal and tailored I/O caching, as do various other I/O heavy applications; generic caches aren't as effective), or possibly a whole pile of corner-case I/O requests (INITIALIZE /ERASE, etc) aimed at the storage.

>First bank was replaced as it was rejected by the system at the startup test after crash.
>Second was replaced as there were some "corectable parity errors".

Some parity errors are expected. Piles of parity errors are a problem, particularly when they're occurring within the same chip. Uncorrectable memory errors are a bigger problem.

>Do you think that clusterwide reboot can help with this issue?

That wouldn't be my immediate approach.

>DB was the only thing that was not rebooted after memory exchange.

Do you have a history of performance data here; does the current performance diverge in any dimensions from historical norms? You mention 2.5x factor with the CPU performance.

A system that's well-tuned will generally be CPU bound (in user mode!), so I might go look at the modes, and at the box that's not running CPU bound, looking for differences.

Applications that are not CPU bound can be I/O bound, or can be hitting memory stalls, for instance, and you might be able to relieve those bottlenecks and get the box back to being CPU-bound.

I'm not sure what you mean by "On twin cluster IO are 30 and here are around 210." I tend to look for disk I/O queue depths, rather than rates. Rates might or might not be a problem, but queue depths are an indication of having reached a bandwidth limit.

Is it possible that your host is simply carrying more of the activity here, having had network connections and such fail over during the repairs? (Put another way, is the aggregate CPU load following your historical norms, and is currently just skewed more onto one box?)

FWIW, multiple hosts sharing and coordinating storage in a cluster will always run with more overhead and with somewhat lower performance than will one node, up until the capacity of that node is exceeded. This due to contention. If anything, you get the best performance by loading each host to saturation, and by splitting up the resources and hardware and data and applications to try to avoid contention among the hosts sharing the load.

Longer term, recognize that your gear here is old and slow, and it might be time to look at an upgrade. A couple of low-end Itanium boxes with a multi-host SCSI shelf (the MSA30-MI, if that's still around) will completely dust this Alpha configuration.
Grzegorz Pawlowski
New Member

Re: High disk IO and low hit ratio

>>HP support stated that now everything should be ok with memory.

>"Trust, but verify."

>I'd confirm the quantity of viable physical memory within the server is the same both before and after the repairs, and I'd also confirm that the hardware caches are not shut off.

Memory quantity is the same as before exchange.
Where to check this hardware cache?

> ...That reeks of database activity (as databases tend to have their own internal and tailored I/O caching...

I've noticed that on bad system in DB most of read and writes are on root file. On good systems reads are mostly on snapfiles.

>Do you have a history of performance data here; does the current performance diverge in any dimensions from historical norms? You mention 2.5x factor with the CPU performance.

I'll dig through documentation and old healthcheck but according to specs this system suppose to handle more than it is able now.


>I'm not sure what you mean by "On twin cluster"

By this i mean perfect copy of software and hardware and system setting. Both clusters are for transactions and have the same content in database.
They sit behing loadballancer which shares the traffic equally.
Even by checking all application counters I see the load is even.
Oracle RDB is on this only disk with high IO rate.
I've compared queue and on bad one average is 0,3 and on other is 0,00.

> Longer term, recognize that your gear here is old and slow, and it might be time to look at an upgrade. A couple of low-end Itanium boxes with a multi-host SCSI shelf (the MSA30-MI, if that's still around) will completely dust this Alpha configuration.

If it would be my HW I would change it a long time ago. Unfortunatly it is customer call and money. :)

PS. If you have any commands that would be helpfull in finding bottleneck they would be usefull.
P Muralidhar Kini
Honored Contributor

Re: High disk IO and low hit ratio

>> >Reads bypassing cache 45 Writes bypassing cache 0
>> >Reads bypassing cache 416 Writes bypassing cache 1313869
>> ...
>> or possibly a whole pile of corner-case I/O requests (INITIALIZE /ERASE,
>> etc) aimed at the storage.
Yes, thats right.
These counts corresponds to XFC's readaround and writearound IO counts.

When the pre-conditions for XFC to cache a IO fails, then the IO would skip the
XFC cache. XFC considers these IO as Readaround (for Read IO's),
Writearound (for Write IO's). XFC would however keep a track of count of such
IO's. Pre conditions would be - caching disabled on file, caching disabled on
IO, IO block size is greater than VCC_MAX_IO_SIZE and so on...

>> Last I checked, Rdb didn't use XFC.
By this, do you mean
XFC is disabled on the system and hence would not come in to the picture.
OR
XFC is enabled. But Rdb uses its own cache hence and most of the requests
would get satisfied from its cache itself and may be a small number of its
IO's might go through XFC.

Regards,
Murali
Let There Be Rock - AC/DC
Hoff
Honored Contributor

Re: High disk IO and low hit ratio

PMK:

http://download.oracle.com/otndocs/products/rdb/pdf/rdbtf05_buffering.pdf

Tools such as memory management and caching are generic, and work adequately well for most applications. Higher-load applications can and variously will implement local memory management and local caching, as the generic applications aren't able to correctly cache I/O. Rdb and caching requirements have been a moving target; check the Rdb documentation for specific recommendations around whether you want XFC caching enabled or not. In various configurations, disabling host caching has been the recommendation. (And if that's disabled, you'll see stuff going past the cacches.)

For some Rdb activities, such as the RUJ and AIJ, having caching is somewhere between futile and wasteful:

http://download.oracle.com/otndocs/products/rdb/pdf/forums_2006/rdbtf06rs_18_sortedidxbperf.pdf

GP:

> Where to check this hardware cache?

I look for indications of cache errors in the error log.

> I've compared queue and on bad one average is 0,3 and on other is 0,00.

Time to figure out what part of Rdb is tossing out I/O. Is there, for instance, some log file that's gone and gotten overly busy?

>If it would be my HW I would change it a long time ago. Unfortunatly it is customer call and money. :)
>PS. If you have any commands that would be helpfull in finding bottleneck they would be usefull.

Talk to your manager and sort out your escalation process, as well as whatever plans might be appropriate for getting off of boat-anchor hardware.