Operating System - HP-UX
1752511 Members
5511 Online
108788 Solutions
New Discussion

Terrible undo bottleneck on Oracle 10g RAC.

 
ricardor_1
Frequent Advisor

Terrible undo bottleneck on Oracle 10g RAC.

Greetings,

I´ve just migrated from a 9i to 10g r2 Oracle RAC setup with two rp4440s running HP-UX 11.11. We´ve been experiencing very high I/O contention on the undo disks.

We have 6 LUNs shared between the two nodes, we are using ASM. These undo areas are creating a huge bottleneck to our HDS AMS500 storage. The disks are rather well distributed along the raid-groups, but these areas are getting too many IO operations even when the database load is low. Glance shows two of the disks to be always 100% load. There´s no apparent storage problem, since we´ve also done volume migration on the HDS to equalize the load. We use HDLM for multipathing and load balancing.

I´m attaching a tarball with some sar information, vmstat and the kernel parameters. The system was patched with the June/2007 bundle, updated with the superseding patches to November, 2007.

We´ve also enabled asyncdsk with 1024 ports and 2048 max_ops, dba has MLOCK permission.
4 REPLIES 4
Steven E. Protter
Exalted Contributor

Re: Terrible undo bottleneck on Oracle 10g RAC.

Shalom,

Though you didn't say it explicitly, I'm assuming the problematic disk is: c8t2d4

Its got a huge wait time and seems buried in I/O

Oracle recommends the underlying storage configuration for redo, data and index be raid 1 or raid 10 for a OLTP heavy write database.

I'm wondering if you are at that level on the storage.

Next idea I'd look at would be system SCSI and I/O patches and oracle patches.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
ricardor_1
Frequent Advisor

Re: Terrible undo bottleneck on Oracle 10g RAC.

Steven,

You are right. Having thought for a while, you brought me the following light: We are using RAID-5. Previously we had the same raid-5 setup but we used SGeRAC with LVMs to support RAC. The lvols were created as distributed over huge VGs where we had many disks.

We had to migrate that environment without being able to stress-test it. Also, it was a database split at the same time. We did not knew how the load was distributed between them.

As we cannot format our storage or add new disk enclosures to it, we´ll split the LUNs over several RAID-5 groups and add them to the ASM DGs, hopefully improving the aggregate IOPS capacity.

Also, the DBAs will update to r4, since we had some traces of the AIO queues filling up (we´ll also set aio_max_ops to 4096).

Thank you very much. If somebody has any further suggestions, you are welcome.
TwoProc
Honored Contributor

Re: Terrible undo bottleneck on Oracle 10g RAC.

Having lots of undo requests has a lot to do with busy activity, and it isn't just an I/O problem in and of itself. Has your team ran some statspack reports to determine what the biggest I/O consumer is? You may find some intersect between your highest I/O processes in the report along with the highest I/O processes in the undo space.

Also, undo tablespace is pretty much like any other , in regards to I/O. That is, it can be cached to reduce the hit. How big is your buffer cache? Do you have the buffer cache advisor turned on?

Also, I see that you're on R5 for this. You really should be on R01 for that, or if not, then just R1. Ditto for temp, redo logs, archive logs.

Lastly, it's very simple to spread that I/O around for the undo tablespace. If you've got more disk, just move some of the files around to other less hot file areas. Or better yet, if you've still unused disk, create several new lvols at the proper Raid level, create your undo tablespace, and then use multiple files, and interleave them across the new mount points.

Your answer may be in all three.

A bit more off track, and less likely, but I'll at least bring them up for discussion...

-> 6 luns - how many disks are represented by that number? Would that be 4 Open-Ls across a single HDS 4 pack each? Meaning 24 disks? If so, that's a pretty good distribution across a number of disks. Are those disks using controllers in rotating order across the HDS? How much cache for the write through do you have on the HDS?

Also, what is the scsi queue depth on the LUNS? The default is 8, and you may have to increase this. "sar -d" will give you avg queue waits. Maybe this needs to be bumped.
We are the people our parents warned us about --Jimmy Buffett
ricardor_1
Frequent Advisor

Re: Terrible undo bottleneck on Oracle 10g RAC.

>Having lots of undo requests has a lot to do
> with busy activity, and it isn't just an
>I/O problem in and of itself. Has your team
>ran some statspack reports to determine what
> the biggest I/O consumer is? You may find
>some intersect between your highest I/O
>processes in the report along with the
>highest I/O processes in the undo space.

We already knew that. It was a job wich wrote serveral million lines before commiting. It´s disabled now, yet we had that very IO contention even when it was not running. We had some undo pages which needed to be taken offiline when we added more disks to two new DGs, where we created new undo areas.

>Also, undo tablespace is pretty much like
>any other , in regards to I/O. That is, it
>can be cached to reduce the hit. How big is >your buffer cache? Do you have the buffer >cache advisor turned on?

I cannot tell you the correct number, but I assure you it´s big enough. We had the very same database running on 9i with very little memory footprint. Currently, we have 7.5 GB total allocated RAM to oracle.

>Also, I see that you're on R5 for this. You
>really should be on R01 for that, or if not,
> then just R1. Ditto for temp, redo logs,
>archive logs.

>Lastly, it's very simple to spread that I/O
>around for the undo tablespace. If you've
>got more disk, just move some of the files
>around to other less hot file areas. Or
>better yet, if you've still unused disk,
>create several new lvols at the proper Raid
>level, create your undo tablespace, and then
> use multiple files, and interleave them
> across the new mount points.

>Your answer may be in all three.

We added 14 new disks divided on two DGs. We now have two undo areas, onde in each of the disk groups. There were balanced along all the RAID-GROUPS we had available. So, we ended with a kind of 0+5 by using striped DGs.

>A bit more off track, and less likely, but
>I'll at least bring them up for discussion...

>-> 6 luns - how many disks are represented
>by that number? Would that be 4 Open-Ls
>across a single HDS 4 pack each? Meaning 24
>disks? If so, that's a pretty good
>distribution across a number of disks. Are
>those disks using controllers in rotating
>order across the HDS? How much cache for the
> write through do you have on the HDS?

We had 3 LUNs (each representing 5 physical drives on a 4+1 RAID5 setup) on two DGs. We had two undo areas, one in each of the DGs. It happens that only one of them were being heavily demanded, so we had actually just 3 LUNs (600 iops each)...

>Also, what is the scsi queue depth on the
>LUNS? The default is 8, and you may have to
>increase this. "sar -d" will give you avg
>queue waits. Maybe this needs to be bumped.

Queue depths are very low now that we distributed the undo areas.

Thank you!