Operating System - OpenVMS
1752288 Members
4682 Online
108786 Solutions
New Discussion юеВ

Re: Weird performance problem

 
Andreas Fassl
Frequent Advisor

Weird performance problem

Dear all,

this one is killing me.
For a customer I did some performance tuning. Using a secure approach I configured a reference system on a smaller system to ensure, that their will be no unexpected failures. All things were fine - and - now I've got an additional performance problem I can't pin down.

The first week the results were good, but now they have response time problems.
Biggest problem is, that it is a fully desupported configuration. (And no chance/budget to upgrade (so this isn't the solution).
- VMS 7.1 (with the recent patches, but not 7.1-1h2 or later)
- Oracle 7.3.2.3
- UCX 4.1 ECO 10 (this is the last ECO)

Maybe some kind soul can give me hints/tipps.

First I suspected (I did the UCX patch upgrade) it is something related to this. But PING results are very good (< 1 ms) and the TNSPINGs, too (< 80 ms).
What I checked:
- LSNRCTL startup times are very long (I checked all possibilities I found after intensive search within METALINK)
like:
- UCX parameters (large, small buffers, device sockets, all ok)
- TCPIP-problems (dns resolving) not existent, interfaces reporting no errors.
- SDU etc. values correct (probably the reason for the now very good TNSPING values)
- Tracefiles activated, but no real errors found.

What I observe:
Logging in via TCPIP and BEQ is very slow.
Description:
Sqlplus user
... fast response
password: xxxx
20 seconds wait, than accepted.
I don't think, that I have a TCPIP/SQL*Net problem.
Same on the reference system needs only 3 seconds.

Alert-log, sqlnet.trc, etc. don't report anything unusual.

To put more speedup in the database I recommended a memory upgrade. This was installed (add. 512 MB kit (refurbished, no original parts available)).
Prod.System: AS 800 5/333 (640 MB)
Ref.System: AS 1000 4/266 (256 MB)

The rest, especially disk layout is a close 1:1 - clone of the production environment.

My current suspects:
- Problem within the memory (but I can't find any proof for that)
- Miscalculated system parameters (the system had last friday problems with the resident programs for oracle (global page table full, fixed this and no problems with this)
- The customer told me about some time-by-time problems with the (!) tokenring card (never had any since all the time) - Interface wt0.

Attached the results of the Oracle RDA (remote diagnostic assistant). I had to to some "patching" to convince the DCL-program to run in this environment.

Any hints/pointers greatly appreciated.

Regards

Andreas
7 REPLIES 7
Andreas Fassl
Frequent Advisor

Re: Weird performance problem

Did some more analysis.

Another suspect - the disks:

I've got a small tool to test read access times:

ORASRV::SYSTEM $ r access
Device to test: dka300
Seek range in MB (0 for full disk): 10
Single or double buffering (s/d): d
Double buffered average access time is 4.6 ms
Test has completed after 3375 random reads.

ORASRV::SYSTEM $ r access
Device to test: dka200
Seek range in MB (0 for full disk): 10
Single or double buffering (s/d): d
Double buffered average access time is 6.6 ms
Test has completed after 1500 random reads.
ORASRV::SYSTEM $ r access
Device to test: dka100
Seek range in MB (0 for full disk): 10
Single or double buffering (s/d): d
Double buffered average access time is 7.5 ms
Test has completed after 1500 random reads.
ORASRV::SYSTEM $ r access
Device to test: dka0
Seek range in MB (0 for full disk): 10
Single or double buffering (s/d): d
Interrupt

The last test on DKA0 (where the OS and Oracle live together) didn't complete after several minutes. So I interrupted the command.

The disk report doesn't show any errors.

Disk ORASRV$DKA0:, device type DEC RZ1CB-CS, is online, mounted, file-oriented
device, shareable, available to cluster, error logging is enabled, device is
busy.

Error count 0 Operations completed 11143527
Owner process "" Owner UIC [SYSTEM]
Owner process ID 00000000 Dev Prot S:RWPL,O:RWPL,G:R,W
Reference count 625 Default buffer size 512
Total blocks 8380080 Sectors per track 113
Total cylinders 3708 Tracks per cylinder 20

Volume label "AXPVMSSYS" Relative volume number 0
Cluster size 9 Transaction count 574
Free blocks 2956563 Maximum files allowed 419004
Extend quantity 5 Mount count 1
Mount status System Cache name "_ORASRV$DKA0:XQPCACHE"
Extent cache size 64 Maximum blocks in extent cache 295656
File ID cache size 64 Blocks currently in extent cache 14544
Quota cache size 0 Maximum buffers in FCP cache 184
Volume owner UIC [1,1] Vol Prot S:RWCD,O:RWCD,G:RWCD,W:RWCD

Volume Status: subject to mount verification, protected subsystems enabled,
write-through caching enabled.

Hein van den Heuvel
Honored Contributor

Re: Weird performance problem

Hmm, I think you are onto something with that DKA0 response time. That's your system disk no? Can you evacuate all Oracle files away form there? (Like your first control file).
I would think there is no point in any other analysis untill that is resolved. You might try some more speed verifications with simple searc/stat or convert/stat, making sure the file is big enough not to fit in your (small vcc cache), just search the pagefile or something like that.

There are a few minor odd things, and the tuning is aggresive the reserved memory and such: ** Reserved memory size = 402653184 greater than created SGA size = 371589120 **
and:
Memory Reservations (pages): Reserved In Use Type
Main Memory (640.00Mb) 81920 13551 65203 3166
ORA_REPORT_SGA 49152 45361 Allocated
ORA_REPORT_SGA 48 45 Page Table
Total (384 Mb reserved) 49200 45406

Oracle is right... those 30M are wasteed and are almost 10% of the SGA, and 5% of the whole system.
There is still plenty of free memory, but the system was not under load was it.


ORA-600 [17114]
see note: Note:34782.1
"KGH Bad magic number in header"
Oracle has detected that the magic number in a memory chunk header has been
overwritten.

This is a heap (in memory) corruption and there is no underlying data
corruption.

The error may occur in the one of the process specific heaps
(the Call heap, PGA heap, or session heap) or in the shared heap (SGA)."

>>> AlphaServer 800 5/333
If you do consider/are forced to upgrade Oralce, then you can not go too far without having to upgrade the CPU. This is an EV5, not EV5.6, so not suitable for the latests Oracle. Also... that's a rather dated alpha with modest memory. It's going to be hard to make that look real good compared to a more modern (alpha) system.


fwiw,
Hein.

Robert Gezelter
Honored Contributor

Re: Weird performance problem

Andreas,

On the DKA0 timing issue, I would check if this is an activity issue or a hardware problem. In some cases, small configuration differences can causes dramatically different IO rates, which can cause something similar to what you describe.

I would check the Cumulative IO count on the disk when you are experiencing the pause. If it is continuing to otherwise process normally (and in some cases, the SHOW command itself may be impacted -- after all, paging/image loading is also from this disk), then I would be suspicious that a difference between the configurations is causing a higher io/paging rate, and you are seeing a contention problem.

You could also verify this by running MONITOR DISK from another workstation and checking the results.

I have also seen a variety of tuning induced similar behaviors.

- Bob Gezelter, http://www.rlgsc.com
Ian Miller.
Honored Contributor

Re: Weird performance problem

you said no problems with dns name resolution - are incoming telnet logins are logged in the operator.log with the nodename of the remote node?

Can you see which step in the login process is taking a long time? - Is it the password validation or after that (something in SYLOGIN parhaps)?
____________________
Purely Personal Opinion
Andreas Fassl
Frequent Advisor

Re: Weird performance problem

Hi,

thanks for the first responses.

Perhaps I wasn't explicit enough:
- Hardware-Upgrade/Software-Upgrade etc. isn't a option in the moment. (Neither VMS nor Oracle)
- The clone system (an elder CPU generation with lower frequency) is running fine.

@Ian:
Login is no problem. I checked the startup of a sql-session with the old "set watch file/class=all" trick. It is not a network configuration problem. All is running fine, but with different (slower) execution times on the AS800.
@Robert:
The problem IS dka0. But I can't find any hint pointing to hardware problems.
Question: Is a DEC RZ1CB-CS with 70 IO/s over the max. IO?
@Hein:
Yes, I was thinking about relocating the Oracle-Home directory, but this should be some sort of last choice. And again, the cloned system is running fine.
The memory wasting hint: I'm aware of this, but it is very difficult to calculate the correct size of the single SGA parameters. I'm using an Excel sheet to do the calculations, but the larger the available memory the larger the potential loss.
The Oracle 600 error you mentioned happens there once a day. Don't ask me, why, probably an old bug of 7.3 (and absolute no change to get a patch).

Just for information about the background:
- I setup this system (I really forgot this when I accepted the task) 8 years ago. A typical launch and forget system (a big advantage of OpenVMS). Additional database software was developed meanwhile, but the developer died last year. No sources in the moment, so no change to tune at the application side.
I'm taking care of this system, because we want to offer a follow-up system (vms based :-) ).
The happier the customer after the tuning, the bigger the chance to convince them to buy a new one from us. :-)
Andreas Fassl
Frequent Advisor

Re: Weird performance problem

Probably found the solution. For some reason, the trace level setting in sqlnet.ora in conjunction with some other activity filled up the systems io-activity. Thanks anyway.
The customer is happy, tuning gave a 2x faster result. Some queries now take couple of seconds instead of 1 hour. :-)

regards

Andreas
Andreas Fassl
Frequent Advisor

Re: Weird performance problem

read the last reply. :-)