Re: System CPU skyrockets until oracle shutdown needed.

Dave Chamberlin · ‎08-11-2010

Last weekend upgraded 9.2 database to 10G (10.2.0.4) on hpux 11.23. Had problem today with climbing CPU from normal (.5) to 18, in about 20 minutes. Could not see any system limits (using glance) on the box being reached. No disks out of space. Plently of RAM available. No errors in the Oracle logs. More and more concurrent requests running slower and slower. Any ideas?

Patrick Wallek · ‎08-11-2010

It appears that you were looking at the CPU load. The CPU load means that there were 18 jobs in the run queue.

What was the actual CPU usage at the time?

What did 'swapinfo -tam' show?

How was the storage behaving? Any problems with lots of WIO for processes?

Were there any large reports running? Had anything changed, other than the new DB version? Are all indexes valid? Perhaps some job was trying to do a full table scan of a very large DB table?

Lots of questions and few answers at this point.

Dave Chamberlin · ‎08-11-2010

The normal system load with about 1200-1400 processes (using TOP), is about .5, and average utilization is about 20%. I have been on this box for a couple of years now, very consistent behavior. Oracle jobs appeared to be stalled, rather than some rogue processes. I do see those from time to time - but they are easily identified and dealt with. I saw nothing consumming a lot of CPU by itself (excepting vxfsd), just nothing was completing, disk IO normal (using GlancePlus). As processes "backed up" in the queue, the load climbed to 18 (36X the normal load). things wer running slow enough that I have to bounce the database. Regular performance followed. I am fairly certain this is related to the oracle 10G upgrade.

TwoProc · ‎08-11-2010

Did you guy run a statspack snapshots over the time of decaying performance? It would be interesting to see what was in there.

We are the people our parents warned us about --Jimmy Buffett

TwoProc · ‎08-11-2010

Was context switching climbing rapidly? Just curious.

When you went to 10g, was the size of anything key increased or created? New pools? More space in the buffer cache? Change for the Auto PGA aggregate target? Was the number of either I/O slaves or DB writers kept consistent? What about redo log interleaving? Redo log buffer size?
sort_area_size for each connection (pga param)?

We are the people our parents warned us about --Jimmy Buffett

Dave Chamberlin · ‎08-11-2010

There were some required parameter changes in going to 10G. I did add 300M to the shared_pool, bring it up to 1900M. Did not change anything related to redo, writers, etc. In our testing we determined (and set) a number of parm changes related to the CBO. I did not have a statspack snap before I downed the instance - so I have nothing there. Context switches - what would I look for? Is a statspack during the problem if it happens again going to be useful - or is there something better?
It seems like a lock issue of some sort. I have had lots of apps locks before - but that causes concurrent processes to pile up. The load increases but not so dramatically. I am attaching graph of system load, etc for the morning for the 5 hours before, up to the problem.

TwoProc · ‎08-11-2010

What size are your shared memory segments in the kernel tuning params? If these are too small, you can lose concurrency. I don't know if it's still true, but the rule used to be no more than 7 shared memory segments for the SGA. So, I always make sure I've got 6 or less.
Reason stated was that the Oracle DB can only keep 7 segments active at the same time, you'll become queued up after that, wherein the system starts picking which segment it can address, and which ones it can't.

I like to make sure that my shmmax is at least 1/5 of the SGA size to keep me out of trouble, adding a bit of fudge to stay away from 1/7 if things grow a bit over time (and they always do somehow).

Re: statspack. You don't need a snapshot from before the upgrade, you need snapshots while things are good, and as they decay, and when everything is bad. After you bring the server back up, then you can run your reports and see what's killing you. You can also see how much cpu time you occupied, how much disk I/O time, etc. The report will also point out to you the top running sql statements and how much cpu, I/O, parse, executions,etc time each one took. By comparing over time, you can see if something is trending. If the top sql remains the top sql (and it should), but the waits on cpu have gone up, or the the waits on I/O are going up, it gives you something to look for.

Watch also your buffer cache hit ratios, your shared pool misses.

If the sql statements change totally per hour, then maybe these newer sql statement have been "detuned" with a bad execution plan from the 10g upgrade. I saw quite a bit of this very thing, especially in the Oracle workflows, which was so critical it tended to slow a lot of things down. It required hinting of the code pieces to use indexes the way I wanted them to, and even in some cases, where tables changed much more rapidly than stats could be gathered on them (complete fills, then almost complete deletion of rows), we just switched them to rule based via hinting. In other cases we created a new index that helped things along and optimizer chose it. The reason was that the old plan, while not perfect, ran fine sans the new index, the new plans poorly, but all we had to do was create an index that the optimizer liked, and it chose it.
The number of occurrences that this happened was small (10 maybe), but some of them were very critical.

What about lock waits? If you're seeing that, then it could be even just the few sql statements that need tuning have some key blocks so busy that it makes everything else wait. And the queue builds up.
I've seen cases where someone creates a new index to help "tune" a problem, and they end up massively slowing everything down. Reason: they created an index that needs to be updated, but left the initrans at default level 2, and you had 15 updates at a time trying to hit portions of that table. Lifting the initrans of the new index and reviewing the initrans of the rest of the related tables, and their indexes freed up the enqueued lock waits.
I've also seen things like this that points not to ONE thing, but points out the need to tune lots of things.
Since you're queueing up, I'd be reviewing statspack for sql code that is running longer, and in the top list of waits on cpu, for starters, after that, go for top sql statements by number of executions, because 1 statement running very quickly but millions of times per hour can eat up a lot of cpu. I've seen this when folks that are used to object oriented programming push code down to the lowest level possible, many times re-running code over and over again that's already been fully resolved or just resolvable with the same query at a much higher level in the code. For instance, getting let's say, inventory_organization_id of a collection of data, for each node of data. Even though that data is relevant and fully defined at the whole collection level. Now, suppose suddenly the thing runs a little slower, and it was never noticed before, and now, reexecuting this called piece of code that creates a whole new cursor instance, and does a whole new fetch, returns the thing on a stack each and every time for a million nodes. Suddenly your little piece of code that used to cost "1" or "2" now costs "only 8" but now it and a small pile of its poorly performing friends are collectively killing the server.

Also, look for long-running (real plain ol' clock time) queries, and see if they now have a) bad execution plans and/or b) are now sorting on disk instead of in internal swap space.

As for context switching you can see it in glance and perfview, "sar -w " will do it, as well as vmstat.

You can also determine context switching on a long running process with a piece from the following web page: http://christianbilien.wordpress.com/2007/04/

An Excerpt: (uses the os pid number of the Oracle shadow process as ospid, assuming a client/server setup and not a local or bequeathed connection).
----------------------------------
Oradebug helps diagnosing the LGWR forced and volontary switches:

SQL> oradebug setospid 2022
Oracle pid: 6, Unix process pid: 2022, image: oracle@mymachine12 (LGWR)
SQL> oradebug unlimit
Statement processed.
SQL> oradebug procstat
Statement processed.
SQL> oradebug tracefile_name
/oracle/product/10.2.0/admin/MYDB/bdump/mydb_lgwr_2022.trc
SQL> !more /oracle/product/10.2.0/admin/MYDB/bdump/mydb_lgwr_2022.trc
â ¦
Voluntary context switches = 272
Involuntary context switches = 167
----------------------------

Hope some of this helps, let us know what y'all are seeing.

We are the people our parents warned us about --Jimmy Buffett

Dave Chamberlin · ‎08-24-2010

Believe problem was caused by massive group of objects put into invalid (uncompiled) state. There were a few thousand objects with invalid status, recompiled and now everything normal. Now that we know what happened we can prevent future issues. thanks for all goo suggestions!

TwoProc · ‎08-30-2010

Makes sense, - however, I've not seen it cause a system to "slowly" degrade. More like instantly degrade non-stop until fixed. Also, lots of errors in applications everywhere should be evident almost instantly.

We are the people our parents warned us about --Jimmy Buffett

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: System CPU skyrockets until oracle shutdown needed.

System CPU skyrockets until oracle shutdown needed.