Performance assistance sought

BoyeDav · ‎01-22-2007

First, let me say I'm by no means an OpenVMS guru.

We have an HP rx3600, dual Itanium2 (dual core), 24GB of RAM running OpenVMS 8.3. The drive sub system is (8) 73GB 15K RPM SAS drives. The system drive consists of two of the drives with hardware mirroring. The remainder of the drives are using RAID server for raid 0+1.

Of the 24GB of RAM, we're reserving 16GB of it for DECRAM, so we actually have 8GB to work with. We'll start utilizing the DECRAM one we have everything else where we want it.

We switched over to this server last week after having been on an Alpha DS20E for five years. The alpha has 2GB of RAM and an external array of 9.1GB SCSI disks.

We're NOT in a cluster.

Given the horsepower of the new server, we should be seeing noticeably faster performance, but that's not the case. For example, we backup our system and data disks locally before ftping them to another server before they go to tape. The local disk-to-disk backups used to take 2.5 hours to backup 35GB on our old server. They now take nearly six hours for 44GB on the new server. We're using BACKUP, and I basically re-used the backup script for the old server. Not only that, but our users are reporting that their jobs are taking noticeably longer than they used to.

I know autogen was done as part of the initial install. Beyond that, we ran another autogen a couple days after the system was put into production. After migrating SYSUAF to the new server, we bumped up the quotas on all users to be at least the 8.3 defaults.

All indications seem to be that the system is idle and has horsepower to spare, so I can't understand why it's less responsive then they old system. Seems like a tuning issue rather than a hardware limitation issue.

I'm attaching the output of several monitor sessions as well as our SYSMAN parameters.

Any guidance or help you can provide is appreciated!

Hein van den Heuvel · ‎01-22-2007

My kneejerk reaction is something is hammering the system disk DKA3.
Is there a log going crazy (image accounting? auditing).

The defautl fiel extent is no set in sysgen.
Check with SHOW RMS. Similar to alpha?
High-water marking change?

How is the XFC cache behaving? Could it have gotten confused by seeing all this memory, and then see it being taken away for DECram?
You are not using the DECram for not right?
I guess you did not use it on the Alpha either with the limited memory available.
Try ditching that for a while?

Finally, how did the application get ported?
Are ALIGNMENT FAULTS under control (low 10 - 20 /sec or zero rate).
Check out with $MONITOR ALIGN

Finally, one monitor picture I am missing is MONI MODE (my favourite :-).

Hope this helps some,

Hein van den Heuvel
HvdH Performance Consulting

Wim Van den Wyngaert · ‎01-22-2007

As Hein said, I suspect RMS settings.

Check sylogin.com for differences between alpha and intel. Or even the login of the user doing the backups.

If not, post show proc/all and/or the logout info of the backup process (batch ?).

Wim

Wim

Hoff · ‎01-22-2007

I'd run various series of tests here, tweaking a setting on (or off) and looking for the bottlenecks.

The first limit you are reporting here is apparently involving the I/O bandwidth. It appears saturated.

The RAID volumes are definitely backed up; the Integrity rx3600 series server is tossing I/O significantly faster than the storage can deal with it. (This based on the queue length listings shown in the monitor report. All I/O is waiting. Sometimes massively waiting. The DPA11 device is in really, really, really bad shape.)

General: an I/O queue length of 0.5 means that half of all I/O is waiting. I see one queue -- if those values from MONITOR can be trusted here -- with a queue length of 1.99. That means each I/O is waiting for two other I/Os ahead of it. A continuous rate 0.5 is ugly. A continuous rate of 1.99 is really ugly. Yes, you can also see transient I/O spikes, and you can usually ignore those.

Tools such as the Freeware DFU or the DFO defragger can be used to detect the level of fragmentation at the volume level. If RMS indexed files are in use, I'd look at tuning and file conversions and cleanup. These can help reduce the I/O load; defragmentation and file conversion can reduce the need for I/O.

I'd also look around for processes generating massive I/O. Run-away, looping or otherwise.

Also look at the size of the I/O cache. If something has managed to disable the cache -- like that honking big DECram reservation setting out there -- then VMS has no choice but to toss I/O at the disk. There are SHOW MEMORY commands and SDA XFC extensions and such for evaluating the efficiency of the XFC cache; commands such as SHOW MEMORY /CACHE /FULL.

I don't see large volumes of reads, but I do see stuff writing to the disk, and looking up and opening files. Do you expect to have a heavy write I/O load as compared with read? (That's not what I usually expect to find on most systems. Are there looping or run-away processes lurking somewhere? Something here is generating direct I/O.)

Find out what process(es) or function(s) are causing that volume of I/O. If it's normal, either work into the applications to reduce it or work into the hardware to increase the I/O bandwidth out to the spindles. Or both. Or spread the I/O load across (more) spindles.

See if there is a volume shadowing operation running.

Snoop around and see if you can figure out what's running the lock manager.

See if the auditing and accounting logs show something odd -- or show volumes of something unexpected, like process failures.

The OpenVMS V8.3 quota defaults are just defaults. Applications can require more. Sometimes far more. For instance, there is a set of process quotas for the BACKUP operator username over in the system manager's manual. The quota ratios over there are tuned for better BACKUP performance.

If you are reading and writing from the same spindles during a BACKUP, you might well be exceeding the bus or controller or spindle, too. DLT tapes are very fast devices, and going direct to tape can expedite operations. If you go disk-to-disk, do look at what I/O paths are involved.

FWIW: OpenVMS I64 memory usage, image sizes and other such are generally somewhere between two and four times the equivalent usage and sizes found with OpenVMS Alpha.

Look at the disk structures, too. Did you reinitialize the disks for OpenVMS I64 and reload the contents, or were the disk settings from the 9GB disks brought across from the OpenVMS Alpha system? As a rule of thumb, disk cluster sizes should generally be 16 to 32, or some multiple of 16. This tends to keep transfers aligned, and there are I/O controllers around that are quite sensitive to this. (This recommendation from the OpenVMS I/O engineering team.)

There's a whole lot of information in the tuning and performance manual, if that's of interest. That's a systematic approach toward tuning, and can help you both with some of the details and with the general tuning process you can approach this with. At its simplest (given I'm typing this reply into a slot about a third the size of a VT terminal display) seek details of the problem and a potential trigger for the problem, tweak something related, test it, and either leave it tweaked or untweak, it depending on the results of the testing.

Grab a copy of the T4 tools stuff, and set up to run MONITOR over time. MONITOR has a recording option, and it can be handy and useful to have longer-scale data around what the system is doing, and when. T4 ties into this, too, and allows easier reporting.

I'd start running a continuous MONITOR with recording, and recording at every 1 or 5 or 15 minutes or other such value, depending on the available disk storage and the granularity of the typical system activity. I might well choose to run two passes, one when necessary and at a shorter interval (1 min?) when tuning, and another parallel pass at 15 minute intervals, for long-term trends and tuning and resource tracking.

And a general side note: whilst working at this all, I'd probably choose to move the disks out of allocation class one and keep everything out of allocation class two where you can, these classes should be reserved for potential/eventual use of Fibre Channel disk and tape. This adjustment is not required, and simply as the opportunity presents itself.

As a second general note: apply current ECOs for OpenVMS I64 V8.3.

And my apologies in advance: this is a big topic area, and this is a very small text box, and I'm aiming at some immediate hot-spots, and spots which may or may not be the trigger. As for next steps, watch for the replies here. Also consider that there are vendors around that can assist with monitoring and with system tuning -- and full disclosure: HoffmanLabs is one -- and these can include HP services and various HP partners. And start running the monitoring; start collecting data.

Stephen Hoffman
HoffmanLabs

Volker Halle · ‎01-22-2007

Performance questions are always very ahrd to answer in a forum like this and always present fodder for lots of speculation.

Which kind of operation was running when you took the MONITOR data ?

Backup of the system disk to DPA11: with high-water marking enabled ?

Volker.

BoyeDav · ‎01-22-2007

Here's a couple more monitor outputs based on the feedback so far.

I'm going to do some digging based on what everyone has said and see what I can find out. Thanks!

GuentherF · ‎01-22-2007

"...using BACKUP, and I basically re-used the backup script for the old server. Not only that, but our users are reporting that their jobs are taking noticeably longer..."

Would be useful to see the backup command line used in the script. Also the account quota/limits of the process/user running the script.

What is running in those user jobs? Any chance to compare the summary output at the end of the logfile with one from the Alpha system?

/Guenther

BoyeDav · ‎01-22-2007

We have a huge number of alignment faults:

Exec AFault rate average of 40
User Fault rate average of 684

Robert Gezelter · ‎01-22-2007

BoyeDav,

I will not restate Hein and Hoff's comments, which I second.

I will however, ask about the settings on the target volume for the BACKUP save set. What are the settings on FILE EXTENT size. Also, what are the settings on the account that is running the BACKUP.

I have many times seen impressive performance spans for BACKUP operations.

I would also agree with the comment that it is a challenge to diagnose these online. Much more data is often needed (Disclaimer: We also do consulting work in this area).

- Bob Gezelter, http://www.rlgsc.com

Hoff · ‎01-22-2007

Do locate from whence the alignment faults arise; for whom the alignment faults toll.

Alignment faults are certainly a serious application performance issue and they're not cheap, but I'd not tend to expect them to be triggering a large-scale I/O issue. (There'as a MONITOR item for alignment faults in V8.3, too. MONITOR ALIGN. Also see the available SDA extensions for this, the FLT extension, and see the Debugger command SET BREAK /ALIGN)

The alignment faults I've seen have hammered on the application performance. Not on the disks.

As for what Bob G. mentioned, there's a suggestion around to invoke a SET RMS command sometime before the BACKUP command, and you can see that step improve BACKUP performance. See the section on "What can I do to improve BACKUP performance?" over in the OpenVMS FAQ; the master FAQ distribution is available over at http://www.hoffmanlabs.com/vmsfaq/

I should toss a section on alignment faults into the FAQ, too.

For another discussion of BACKUP performance here in ITRC, see http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=917271

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Performance assistance sought

Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought