Performance assistance sought

BoyeDav · ‎01-22-2007

First, let me say I'm by no means an OpenVMS guru.

We have an HP rx3600, dual Itanium2 (dual core), 24GB of RAM running OpenVMS 8.3. The drive sub system is (8) 73GB 15K RPM SAS drives. The system drive consists of two of the drives with hardware mirroring. The remainder of the drives are using RAID server for raid 0+1.

Of the 24GB of RAM, we're reserving 16GB of it for DECRAM, so we actually have 8GB to work with. We'll start utilizing the DECRAM one we have everything else where we want it.

We switched over to this server last week after having been on an Alpha DS20E for five years. The alpha has 2GB of RAM and an external array of 9.1GB SCSI disks.

We're NOT in a cluster.

Given the horsepower of the new server, we should be seeing noticeably faster performance, but that's not the case. For example, we backup our system and data disks locally before ftping them to another server before they go to tape. The local disk-to-disk backups used to take 2.5 hours to backup 35GB on our old server. They now take nearly six hours for 44GB on the new server. We're using BACKUP, and I basically re-used the backup script for the old server. Not only that, but our users are reporting that their jobs are taking noticeably longer than they used to.

I know autogen was done as part of the initial install. Beyond that, we ran another autogen a couple days after the system was put into production. After migrating SYSUAF to the new server, we bumped up the quotas on all users to be at least the 8.3 defaults.

All indications seem to be that the system is idle and has horsepower to spare, so I can't understand why it's less responsive then they old system. Seems like a tuning issue rather than a hardware limitation issue.

I'm attaching the output of several monitor sessions as well as our SYSMAN parameters.

Any guidance or help you can provide is appreciated!

Hein van den Heuvel · ‎01-22-2007

My kneejerk reaction is something is hammering the system disk DKA3.
Is there a log going crazy (image accounting? auditing).

The defautl fiel extent is no set in sysgen.
Check with SHOW RMS. Similar to alpha?
High-water marking change?

How is the XFC cache behaving? Could it have gotten confused by seeing all this memory, and then see it being taken away for DECram?
You are not using the DECram for not right?
I guess you did not use it on the Alpha either with the limited memory available.
Try ditching that for a while?

Finally, how did the application get ported?
Are ALIGNMENT FAULTS under control (low 10 - 20 /sec or zero rate).
Check out with $MONITOR ALIGN

Finally, one monitor picture I am missing is MONI MODE (my favourite :-).

Hope this helps some,

Hein van den Heuvel
HvdH Performance Consulting

Wim Van den Wyngaert · ‎01-22-2007

As Hein said, I suspect RMS settings.

Check sylogin.com for differences between alpha and intel. Or even the login of the user doing the backups.

If not, post show proc/all and/or the logout info of the backup process (batch ?).

Wim

Wim

Hoff · ‎01-22-2007

I'd run various series of tests here, tweaking a setting on (or off) and looking for the bottlenecks.

The first limit you are reporting here is apparently involving the I/O bandwidth. It appears saturated.

The RAID volumes are definitely backed up; the Integrity rx3600 series server is tossing I/O significantly faster than the storage can deal with it. (This based on the queue length listings shown in the monitor report. All I/O is waiting. Sometimes massively waiting. The DPA11 device is in really, really, really bad shape.)

General: an I/O queue length of 0.5 means that half of all I/O is waiting. I see one queue -- if those values from MONITOR can be trusted here -- with a queue length of 1.99. That means each I/O is waiting for two other I/Os ahead of it. A continuous rate 0.5 is ugly. A continuous rate of 1.99 is really ugly. Yes, you can also see transient I/O spikes, and you can usually ignore those.

Tools such as the Freeware DFU or the DFO defragger can be used to detect the level of fragmentation at the volume level. If RMS indexed files are in use, I'd look at tuning and file conversions and cleanup. These can help reduce the I/O load; defragmentation and file conversion can reduce the need for I/O.

I'd also look around for processes generating massive I/O. Run-away, looping or otherwise.

Also look at the size of the I/O cache. If something has managed to disable the cache -- like that honking big DECram reservation setting out there -- then VMS has no choice but to toss I/O at the disk. There are SHOW MEMORY commands and SDA XFC extensions and such for evaluating the efficiency of the XFC cache; commands such as SHOW MEMORY /CACHE /FULL.

I don't see large volumes of reads, but I do see stuff writing to the disk, and looking up and opening files. Do you expect to have a heavy write I/O load as compared with read? (That's not what I usually expect to find on most systems. Are there looping or run-away processes lurking somewhere? Something here is generating direct I/O.)

Find out what process(es) or function(s) are causing that volume of I/O. If it's normal, either work into the applications to reduce it or work into the hardware to increase the I/O bandwidth out to the spindles. Or both. Or spread the I/O load across (more) spindles.

See if there is a volume shadowing operation running.

Snoop around and see if you can figure out what's running the lock manager.

See if the auditing and accounting logs show something odd -- or show volumes of something unexpected, like process failures.

The OpenVMS V8.3 quota defaults are just defaults. Applications can require more. Sometimes far more. For instance, there is a set of process quotas for the BACKUP operator username over in the system manager's manual. The quota ratios over there are tuned for better BACKUP performance.

If you are reading and writing from the same spindles during a BACKUP, you might well be exceeding the bus or controller or spindle, too. DLT tapes are very fast devices, and going direct to tape can expedite operations. If you go disk-to-disk, do look at what I/O paths are involved.

FWIW: OpenVMS I64 memory usage, image sizes and other such are generally somewhere between two and four times the equivalent usage and sizes found with OpenVMS Alpha.

Look at the disk structures, too. Did you reinitialize the disks for OpenVMS I64 and reload the contents, or were the disk settings from the 9GB disks brought across from the OpenVMS Alpha system? As a rule of thumb, disk cluster sizes should generally be 16 to 32, or some multiple of 16. This tends to keep transfers aligned, and there are I/O controllers around that are quite sensitive to this. (This recommendation from the OpenVMS I/O engineering team.)

There's a whole lot of information in the tuning and performance manual, if that's of interest. That's a systematic approach toward tuning, and can help you both with some of the details and with the general tuning process you can approach this with. At its simplest (given I'm typing this reply into a slot about a third the size of a VT terminal display) seek details of the problem and a potential trigger for the problem, tweak something related, test it, and either leave it tweaked or untweak, it depending on the results of the testing.

Grab a copy of the T4 tools stuff, and set up to run MONITOR over time. MONITOR has a recording option, and it can be handy and useful to have longer-scale data around what the system is doing, and when. T4 ties into this, too, and allows easier reporting.

I'd start running a continuous MONITOR with recording, and recording at every 1 or 5 or 15 minutes or other such value, depending on the available disk storage and the granularity of the typical system activity. I might well choose to run two passes, one when necessary and at a shorter interval (1 min?) when tuning, and another parallel pass at 15 minute intervals, for long-term trends and tuning and resource tracking.

And a general side note: whilst working at this all, I'd probably choose to move the disks out of allocation class one and keep everything out of allocation class two where you can, these classes should be reserved for potential/eventual use of Fibre Channel disk and tape. This adjustment is not required, and simply as the opportunity presents itself.

As a second general note: apply current ECOs for OpenVMS I64 V8.3.

And my apologies in advance: this is a big topic area, and this is a very small text box, and I'm aiming at some immediate hot-spots, and spots which may or may not be the trigger. As for next steps, watch for the replies here. Also consider that there are vendors around that can assist with monitoring and with system tuning -- and full disclosure: HoffmanLabs is one -- and these can include HP services and various HP partners. And start running the monitoring; start collecting data.

Stephen Hoffman
HoffmanLabs

Volker Halle · ‎01-22-2007

Performance questions are always very ahrd to answer in a forum like this and always present fodder for lots of speculation.

Which kind of operation was running when you took the MONITOR data ?

Backup of the system disk to DPA11: with high-water marking enabled ?

Volker.

BoyeDav · ‎01-22-2007

Here's a couple more monitor outputs based on the feedback so far.

I'm going to do some digging based on what everyone has said and see what I can find out. Thanks!

GuentherF · ‎01-22-2007

"...using BACKUP, and I basically re-used the backup script for the old server. Not only that, but our users are reporting that their jobs are taking noticeably longer..."

Would be useful to see the backup command line used in the script. Also the account quota/limits of the process/user running the script.

What is running in those user jobs? Any chance to compare the summary output at the end of the logfile with one from the Alpha system?

/Guenther

BoyeDav · ‎01-22-2007

We have a huge number of alignment faults:

Exec AFault rate average of 40
User Fault rate average of 684

Robert Gezelter · ‎01-22-2007

BoyeDav,

I will not restate Hein and Hoff's comments, which I second.

I will however, ask about the settings on the target volume for the BACKUP save set. What are the settings on FILE EXTENT size. Also, what are the settings on the account that is running the BACKUP.

I have many times seen impressive performance spans for BACKUP operations.

I would also agree with the comment that it is a challenge to diagnose these online. Much more data is often needed (Disclaimer: We also do consulting work in this area).

- Bob Gezelter, http://www.rlgsc.com

Hoff · ‎01-22-2007

Do locate from whence the alignment faults arise; for whom the alignment faults toll.

Alignment faults are certainly a serious application performance issue and they're not cheap, but I'd not tend to expect them to be triggering a large-scale I/O issue. (There'as a MONITOR item for alignment faults in V8.3, too. MONITOR ALIGN. Also see the available SDA extensions for this, the FLT extension, and see the Debugger command SET BREAK /ALIGN)

The alignment faults I've seen have hammered on the application performance. Not on the disks.

As for what Bob G. mentioned, there's a suggestion around to invoke a SET RMS command sometime before the BACKUP command, and you can see that step improve BACKUP performance. See the section on "What can I do to improve BACKUP performance?" over in the OpenVMS FAQ; the master FAQ distribution is available over at http://www.hoffmanlabs.com/vmsfaq/

I should toss a section on alignment faults into the FAQ, too.

For another discussion of BACKUP performance here in ITRC, see http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=917271

Jan van den Ende · ‎01-22-2007

BoyeDave,

from your Forum Profile:

I have assigned points to 3 of 26 responses to my questions.

Maybe you can find some time to do some assigning?

http://forums1.itrc.hp.com/service/forums/helptips.do?#33

Mind, I do NOT say you necessarily need to give lots of points. It is fully up to _YOU_ to decide how many. If you consider an answer is not deserving any points, you can also assign 0 ( = zero ) points, and then that answer will no longer be counted as unassigned.
Consider, that every poster took at least the trouble of posting for you!

To easily find your streams with unassigned points, click your own name somewhere.
This will bring up your profile.
Near the bottom of that page, under the caption "My Question(s)" you will find "questions or topics with unassigned points " Clicking that will give all, and only, your questions that still have unassigned postings.

Thanks on behalf of your Forum colleagues.

PS. - nothing personal in this. I try to post it to everyone with this kind of assignment ratio in this forum. If you have received a posting like this before - please do not take offence - none is intended!

PPS. - Zero points for this.

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

BoyeDav · ‎01-22-2007

I'll try to be better about assigning points.

Thanks for all the suggestions. We have a support contract with HP, so that's an option I might go with.

In the mean time, I have some digging to do to understand a lot of what everyone is saying. I need to understand alignment faults and how to determine file extent size, etc.

Robert Gezelter · ‎01-22-2007

BoyeDav,

As to the RMS settings, the command to show the current settings is SHOW RMS (the command to change the settings is SET RMS). Both commands are documented in the HELP text.

As an example, I generally do the following line in my LOGIN.COM:
$ SET RMS/EXTENT=100000

on the account that I am expecting to write large files from. It is also a good idea in that case to check the growing file from another process to ensure that the file is being extended using the RMS value, and the program does not have its own, hardcoded opinion of how much to extend the file by.

- Bob Gezelter, http://www.rlgsc.com

Karl Rohwedder · ‎01-22-2007

< As an example, I generally do the following
< line in my LOGIN.COM:
< $ SET RMS/EXTENT=100000

Pls. note that the maximum extend is 65535:

$ set rms/ext=100000
%SET-E-VALERR, specified value is out of legal range

regards Kalle

Colin Butcher · ‎01-22-2007

As someone else pointed out - performance issues are difficult (if not impossible) to assess without actually seeing the system.

Data from collection and visualisation tools such as T4 can help a lot, but it's a case of understanding your workload and seeing all the other things which might be going on that may well be relevant.

I'd encourage you to get some short-term expert help on-site - then work with them and learn from them during the investigation. It's usually a much faster and more effective way to solve problems.

Hope that helps. Cheers, Colin.

Entia non sunt multiplicanda praeter necessitatem (Occam's razor).

BoyeDav · ‎01-23-2007

Here is my SHOW RMS output. It's the same as on the old server.

BoyeDav · ‎01-23-2007

...and here's what I get showing RMS_EXTEND_SIZE from SYSMAN:

current: 0
min: 0
default: 0
max: 65535

Also the same as on the old server.

Robert Gezelter · ‎01-23-2007

Karl,

Indeed. The hazard of posting very late at night after a long day. I did not see the extra '0'.

My apologies for the typographical error.

- Bob Gezelter, http://www.rlgsc.com

Volker Halle · ‎01-23-2007

You did not yet explicitly confirm the workload running on your system at the time when you've been taking the MONITOR samples.

Is my guess true, that you were running BACKUP/IMAGE SYS$SYSDEVICE: DPA11:saveset/SAVE ?

This would explain the DKA3: IO load and the DPA11: IO queue. But if your workload was completely different, these conclusions would just be wrong.

Don't worry about these Alignment Faults yet. You CPUs are mostly idle and Alignment Faults won't cause disk-IOs. 'Huge numbers' of alignment faults look different...

Volker.

BoyeDav · ‎01-24-2007

Sorry for the slow response. I've been trying to absorb everything and educate myself a little on all this. To answer some of the questions:

The MONITOR results were done during the day under typical workloads. We use an application called POISE, which I'd call a database application. I used the BACKUP times as an example, but users overall are seeing slowness.

For backups, we're using BACKUP/IGNORE=INTERLOCK since we're not backing up the whole volumes. Our system analysts will sometimes archive database copies locallay for a couple weeks after major changes, and we don't back up those archives nightly. Including those would add about 15% to our backup size. Would the performance benefits of /IMAGE be worth it?

I've added $SET RMS/EXTEND=50000 to the begining of my backup script. I've added this to SYLOGIN.COM:

$ set rms/seq/block=127/buff=8
$ set rms/ind/buf=20
$ set rms/extent=4096

Also, I've been informed that the SAS1068 controller that shipped with our rx3600 has no onboard cache. Not sure to what extent that makes a difference. I come from an x86 background, and this is the first time I've dealt with an "enterprise" controller that didn't have some type of cache.

BoyeDav · ‎01-24-2007

With the above settings, I can do a backup of our system in about 8 minutes, and that's during the day with all of our users doing their stuff. Last night, it took 75 minutes with hardly anyone on the system. I've tested this both interactively and running as a batch. Cool!

Hein van den Heuvel · ‎01-24-2007

It's good to see its coming together.

>> I've added $SET RMS/EXTEND=50000 to the begining of my backup script.

Fine

>> I've added this to SYLOGIN.COM:
>> $ set rms/seq/block=127/buff=8
>> $ set rms/ind/buf=20
>> $ set rms/extent=4096

Not so fine, but better than what you have.
You don't want to execute that over and over.
Why not SET RMS/SYS for all, or better still, do the SYSGEN work to have this permanently set.

Also, the sequential fiel tuning is a bit aggresive. You are telling the system to allocate 512 KB for each and every sequential file opened, for each and every process. That's a significant jump from the 32 KB original default.
Also, 127 while the max, is 'odd'.
I would suggest /block=64/buf=3 as an intermediate step system wide.
Individual, well understood, processes may like /bloc=112/buf=8 ! (112 = 7*16)

Regards,
Hein van den Heuvel
HvdH Performance Consulting

BoyeDav · ‎01-24-2007

Thanks again for the guidance. I'm still a little nervous poking around SYSGEN, but I can implement the other suggestions easily enough.

BoyeDav · ‎01-24-2007

IO Queue request rates are showing a constant 0.0 for average, min, and max. Things seem to be looking healthier.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Performance assistance sought

Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought

Re: Performance assistance sought