High load

Consty · ‎11-19-2007

I am facing a problem of very slow response from system Tru64,4 CPU. The load is always high around 100%Cpu. I am suspecting a new Oracle software installed one day before, the problem started after the systeme was rebooted.
I would like to know where the problem comes from, I am not used to Tru64 and I need help.
How can I proceed to tune the system ?
Thanks for help
Consty

Rob Leadbeater · ‎11-19-2007

Hi Consty,

Can you provide a bit more detail please.

What Oracle software has been loaded ?
Exactly what hardware have you got ?

You say the problem started after a reboot. When was the system last restarted prior to this ? Could any other changes have happened in that time that only take effect on reboot ?

Hope this helps,

Regards,

Rob

P.S. I've asked a moderator to move your question to the Tru64 forums where it will get more attention.

Consty · ‎11-19-2007

Thanks for your answer.

What Oracle software has been loaded ?
-JRE 1.4.2
-WGET 1.1.10
-Oracle Enterprise Manager Agent 10.2.0.2.0

Exactly what hardware have you got ?
-2 x Alpha ES45
-1 x MSA1000
-Tru64 5.1.B
-TruCluster
-Oracle 9.2.0.5

You say the problem started after a reboot. When was the system last restarted prior to this ? Could any other changes have happened in that time that only take effect on reboot ?

-The software were installed on thurday morning, the system was first rebooted at noon, the users noticed a long response time. On Friday the system was again restarted and from that time the chart shows high load.

Regards

Ivan Ferreira · ‎11-19-2007

>>> the problem started after the systeme was rebooted.

High load in a Cluster environment could be caused by wront CFS owners. Probably, when you rebooted the server, the control of all cluster file systems was taked by the remaining node.

You have to ensure that each node has control over the file systems that access most. You can check and change with the cfsmgr command.

To check who is the owner:
cfsmgr /oradata

To relocate:
cfsmgr -a server= /oradata

Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?

Srikanth Arunachalam · ‎11-19-2007

Hi Consty,

Oracle 10g loaded on TRU64 can attribute to performance issue for several reason.

(1) If there is ASSM set, do turn it off. Oracle recommends that there is some proble with ASSM and will have to be switched off.

(2) What does the AWR reports says so? You will find this AWR reports in the $ORACLE_HOME/rdbms/admin/awrrpt.* Do run them and get the statistics in html.

(3) Check who contributes to maximum performance load using the enterprise manager. Oracle's Enterprise Manager is not performance effective. Do get the graphical performance statis on Enterprise manager and then turn it off.

Do let me have the statistics of AWR and Enterprise manager reports. I will be able to give you more inputs.

Thanks,
Srikanth

Rob Leadbeater · ‎11-20-2007

But the database version is 9.2.0.5 not 10g.

The only part of 10g that has been loaded is the Enterprise Manager Agent.

Personally I would avoid anything Oracle 10g related on Tru64...

Cheers,

Rob

Vladimir Fabecic · ‎11-20-2007

First do what Ivan suggested.
Or just reboot another member (while this up).
By the way, how about details such as what processes are top, memory utilization, I/O load etc.
Ivan may be right, but there may also be another issues.

In vino veritas, in VMS cluster

Srikanth Arunachalam · ‎11-20-2007

hi,

Sorry I overlooked the enterprise manager version to that of Oracle version. In that case, I would like to have the statspack report please. Its very easy to install the installable is just creation of users and running a script present in $ORACLE_HOMe/rdbms/admin.

Let me know if you want more information on this.

Thanks,
Srikanth

Consty · ‎11-20-2007

Thanks again.

Ivan suggestion : the cfsmgr output is OK for all the filesystems.

Vladimir suggestion : the top process is "kernel_idle", the memory looks correct (no swap), anyway I am going to check it again.

What else to do for system check (outputs you need) before looking at Oracle.

Regards
Consty

Consty · ‎11-20-2007

Hi Srikanth, all,
You'll find here attached the statspack report.
Thanks
Regards
Consty

Ivan Ferreira · ‎11-20-2007

Please post the output of:

collect -scpm -om -S -n 10

Collect this information for some time, compress and attach the file.

Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?

Vladimir Fabecic · ‎11-21-2007

[kernel idle] is a catch-all in the Tru64 UNIX kernel. It gets all of the internal overhead kernel threads, including things like sync-ing the disks, environmental monitoring, some aspects of disk I/O, some memory management overhead, and so on. Basically, it's a "catch all" for the things the kernel is doing on behalf of the system as a whole that can't be blamed on any specific user "job" or process.
So many things can cause your problem.
I/O would be my first guess.
So please send output of what Ivan said.
Did you reboot other machine?

In vino veritas, in VMS cluster

Srikanth Arunachalam · ‎11-21-2007

Hi Consty,

Look at the "load profile" in the statspack,

(1) you have large hard parses (7.85 per second).

(2) The number of executes (187 per second) and transactions are also very large.

There is heavy load on the system. I will be thinking of increasing the shared pool size to give Oracle change to store more DML execution plan on the memory. If your shared pool is small, it has to device execution plan for your transactions and hence more time speant.

Look at the "Instance Efficiency Percentages (Target 100%)"

(1) I am not pleased with the Library Hit ratio of "95.96" (expect it to be more)

If the library Hit ratio was low, it could be indicative of a shared pool that is too small, or just as likely, that the system did not make correct use of bind variables in the application.

(2) The Soft Parse % is also very low (93.96), it is expected to nearly 100.

The Soft Parse % value is one of the most important (if not the only important) ratio in the database. For a typical OLTP system, it should be as near to 100% as possible

So, take a look at your application, make good use of bind variables and increase the shared pool size to larger value.

Let me know what is your physical memory and another statspack during heavy load and light load.

Thanks,
Srikanth

Srikanth Arunachalam · ‎11-21-2007

Hi Consty,

More findings on your statspack. Refer "Top 5 Timed Events" section.

(1) The CPU Time is very large (1368/s)

CPU time is not really a wait event (hence, the new name), but rather the sum of the CPU used by this session, or the amount of CPU time used during the snapshot window. In a heavily loaded system, if the CPU time event is the biggest event, that could point to some CPU-intensive processing (for example, forcing the use of an index when a full scan should have been used), which could be the cause of the bottleneck.

(2) The "Db file sequential read" is also very large (2,334/s) and waits (138,824) is more on it.

Db file sequential read - This wait event will be generated while waiting for writes to TEMP space generally (direct loads, Parallel DML (PDML) such as parallel updates. You may tune the PGA AGGREGATE TARGET parameter to reduce waits on sequential reads.

(3)"Db file scattered read" -> waits of 138,824 and Time of 2,334/s.

This happens generally happens during a full scan of a table. You can use the Statspack report to help identify the query in question and fix it.

Thanks,
Srikanth

Consty · ‎11-21-2007

Thanks so much to all of you,

Ivan, you'll find here attached the output
you asked for server A. I am going to send the one for server B in the next message.

Regards

Consty

Consty · ‎11-21-2007

Message followed,
Error, the previous file was for server B (active node) , here is the one for server A.
Thanks
Regards

Consty

Consty · ‎11-21-2007

Ivan,
This is the text version of the "collect" output, I think it's more simple that way.
Thanks and Regards
Consty

Consty · ‎11-21-2007

Hi Vladimir,
Yes, the second machine was rebooted many times.
Regards
Consty

Ivan Ferreira · ‎11-21-2007

Obviouslly in nodeA ecallprog running under ngominf is taking all CPU, is this normal? What is doing this program?

And the nodeB, has too much CPU used in system time, this is not normal and I saw this behaviour when too much traffic is gone between the nodes via the interconnect or the systems is paging/swapping. In your case it seems that the system is not paging.

What is the output of drdmgr for all your data disks? Both nodes have direct I/O to the disks?

Is your application trying to access "cross" database information frecuently?

Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?

Consty · ‎11-21-2007

Hi Ivan,

I'll give you more information when I'll go back to the site, in the meantime, NodeB is the active node while node A is passive. Both servers are accesing the same MSA disk bay directly. You have seen the very problem, i.e NodeB has too much CPU used in system time.
That is what I wanted to describe in my original message "I am facing a problem of very slow response from system Tru64,4 CPU. The load is always high around 100%Cpu"
I do not know what the problem is, I am suspecting the installation the dba did one day before the problem occured.

Thanks
Consty

Rob Leadbeater · ‎11-21-2007

The only application installed was the management agents for 10gR2.

These aren't essential to the database running, so as a first point of call, uninstall them and/or disable them and see if things improve...

Cheers,

Rob

Consty · ‎11-22-2007

Hi all,

Rob,
We disabled everything but no improvement, I don't know if the kernel has changed.

Ivan,
-ecallprog is a program processing mobile telephone calls.
-Yes, the programs are accessing the database information frequently
-Output of drdmgr attached

Thanks
Regards
Consty

Hein van den Heuvel · ‎11-22-2007

Consty,,

Thanks for the COllect output in text format as well as the statspack.

There is heavy Oracle load, but not excessive it seems. Oracle, and its usage can probably be improved
- review /recode the select count(*) queries
- double the SGA buffer space, as it can use it, and the memory is there.

But that would not have changed with a reboot, nor would it cause the high system time. A common cause for this is paging and swapping but that does not seem to be an issue here.

Be sure to scan the boot-records (UERF -R ? /var/adm/messages? for errors during the boot. Did you keep a (virtual) console log? Maybe some sysconfigtab setting was editted erroneously and did not take?

I would recommend diving into that, using tools to see exactly where the system time is used.

For example, kprofile:

http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V51B_HTML/MAN/MAN1/0658____.HTM

Or better still, DCPI:
http://h30097.www3.hp.com/dcpi

I would probably also use a 'truss' (SYS V extentions CD), or (s)trace to get a system call trace for one of those 'sl' processes.

Finally, my WAG is that something is amiss with the network settings.

Hope this helps some,
Hein van den Heuvel (at gmail dot com)
HvdH Performance Consulting

Rob Leadbeater · ‎11-22-2007

Hi Consty,

> I don't know if the kernel has changed.

Take a look at the time stamp on /vmunix to see when the kernel was built. Note that on a cluster this should be a CDSL (cluster dependant symbolic link) to the boot partition, so you'll have to follow that link to get to the actual vmunix file...

Cheers,

Rob

Consty · ‎11-22-2007

Thanks,
In addition, do you advice me to restore the
system in case we do not see anything ? I know other Unix but not Tru64, how to retore a system on a TruCluster ?
Regards
Consty

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

High load

High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load

Re: High load