Operating System - HP-UX
1833799 Members
3309 Online
110063 Solutions
New Discussion

Re: How to get further info on a process that is hanging a cpu

 
Timothy Czarnik
Esteemed Contributor

How to get further info on a process that is hanging a cpu

Hey all,

Two days ago we started getting an Oracle process that suddenly grabs 100% of a cpu and holds onto it. This is hanging the Oracle database and not allowing further Oracle access. We can't connect to Oracle at all until we kill this process at which time the database becomes available again.

I know that there is a ton to look at as far as SQL and things like that. I'm interested to know what other deep, dark info I can pull from the system about a specific process. Nothing is being logged to any system logs about this as it's an Oracle spawned process.

Any hints on getting further information about a process given it's PID?

Thanks in advance!

Tim
Hey! Who turned out the lights!
7 REPLIES 7
Robert-Jan Goossens
Honored Contributor

Re: How to get further info on a process that is hanging a cpu

HI Tim,

Try tusc/truss the process.

http://hpux.connect.org.uk/hppd/hpux/Sysadmin/tusc-7.7/

Regards,
Robert-Jan
harry d brown jr
Honored Contributor

Re: How to get further info on a process that is hanging a cpu


Do you have glance/measureware?

lsof ?

Is this an application written in house?

live free or die
harry d brown jr
Live Free or Die
Timothy Czarnik
Esteemed Contributor

Re: How to get further info on a process that is hanging a cpu

We do not have Glance or Measureware. I'm not even sure what lsof is to be blatantly honest.
Hey! Who turned out the lights!
harry d brown jr
Honored Contributor

Re: How to get further info on a process that is hanging a cpu


it lists open files, sockets, ... for a process

http://hpux.ee.ualberta.ca/hppd/hpux/Sysadmin/lsof-4.74/

live free or die
harry d brown jr
Live Free or Die
Jack C. Mahaffey
Super Advisor

Re: How to get further info on a process that is hanging a cpu

If you do an oracle trace you should get an idea of which sql is causing the problem.

For example, assume PID causing problems is 2311.

Start sqlplus session
execute following sql:

execute the following:
>select spid, addr, serial#, terminal from v$process where spid = 2311;

Make a note of the value for ADDR. Assume value returned is 3234.

Execute the following:
>select * from v$session where PADDR = 3234;

Record value for SID and SERIAL#;

Assume sid is 322 and SERIAL# is 31223


You are now ready to turn on tracing.

Execute the following:

>execute sys.dbms_system.set_sql_trace_in_session (322, 31223, TRUE);


A trace file will now be created in your Oracle Admin/udump directory. i thing the filename will contain the PID of the oracle process.

Wait a few minutes and then turn off tracing.

Here's sql for turning off tracing.

>execute sys.dbms_system.set_sql_trace_in_session (322, 31223, FALSE);


Now run tkprof to see the sql that executed.

ex:
% tkprof sys=no explain=sys/@ sort='(prsela,exeela,fchela)' print=10

For more information on tracing see the oracle web site.

When you look at the oracle trace report file you should notice some sql statements that are heavy hitters. The next step would be to trace the sql to the application.

jack
Michael Parow
New Member

Re: How to get further info on a process that is hanging a cpu

Hi Tim,

I have had exactly the same issue here.

Previous replys have requested you do this or that to the database, but, as you wrote, the database is hung while the process in spinning.

I have used trus only to find no system call activity at all. Oracle logging has not produced any smoking gun.

The only solution I have found is to bounce the database. Unfortunately, the problem will occur again.
Stuart Whitby
Trusted Contributor

Re: How to get further info on a process that is hanging a cpu

truss/tusc and lsof should be used together to identify what system calls a process is making. Given that this is using 100% CPU, I'd guess it's stuck in a loop. Truss should confirm that, since you'll see the same operations repeated over and over on a very regular basis.

What you get with the combination is a case of tusc showing:

write(18,.......

now using "lsof -p " you can see what file descriptor it's writing on. In this case, it's likely to be a socket with a connection to another Oracle process. You'll get the socket information there, so use "lsof | grep " to identify the process on the other side.

Depending what's going on, you may want to see what it's doing using the same commands.

Can't remember where you get tusc offhand (one of the unsupported HP sites), but lsof is available from vic.cc.purdue.edu/pub/outgoing/tools/lsof (from memory).

The main problem you're likely to have with this is that the process is likely to generate a *LOT* of system calls at 100% CPU. As for actually fixing this issue, I'd get back to whoever wrote the process in the first place and ask them what it's doing. The list of open files (lsof) may well be a big help to them in identifying why it's going wrong, since it may be getting screwy information at some point causing it to do all sorts of extra work. I'd also recommend checking out any reads and writes on other sockets since the process may be doing all sorts of work on behalf of another process, and it would be good to know what that's doing.

Get a copy of Ethereal on there as well (www.ethereal.com). This will allow you to look at the contents of the packets which are passed between the various processes. Now that truss has shown you what ports the process is talking on, lsof has told you who it's talking to, Ethereal can show you what it's talking about. Look at the packet contents and try to reconstruct what's going on. You'll probably spot something daft. The only problem is if the only daft stuff is right at the start of the problem and you come into this once the problem's occurred - you may not catch this without some serious effort...
A sysadmin should never cross his fingers in the hope commands will work. Makes for a lot of mistakes while typing.