Re: System anomalies

Jim Mallett · ‎01-18-2005

Sorry, couldn't think of a subject because I don't know what avenue to pursue with this.
For about 3 weeks I've been having a problem with an L2000. The main problem is it's not leaving me any clues as to what the issue is, or I'm just not picking them up.
System Info:
2 processor L2000 running 11i, w/ 8G ram. System runs 3 Oracle instances (2x 9.2.0, and 1x 8.1.7) and Data Protector (a couple other lightweight apps).
The server has been functioning w/o issue for 3 years. There has been slightly more traffic on them, but we're only talking 6-10 users.
Over the past three weeks, there have been 4 occasions when the databases would not shut down. I made sure there were no external connections to them. Only a shutdown abort would allow them to close. There have also been 3 occasions where for no apparent reason the WebDB listener would stop listening on port 2000. The process was still running, but netstat showed nothing on 2000. I've checked all the log files and there is NOTHING from an Oracle standpoint logged. I found one similar post from a couple years ago but no indication of how he resolved it.

I've attached (kmtune, swapinfo, ipcs, shminfo). The only thing I can see is that the following kernel settings may not meet Oracle's specifications: MAXUPRC, MAXSEG, SHMMAX. One SGA is getting close to 1G, and my SHMMAX is at 1G so I'm going to up that to 2G this evening.

Maybe somebody will see something glaring in the attachment. Maybe I should practice saying "Would you like fries with that?". Either way, ANY thoughts or ideas would be appreciated. It's worth all the beer you can drink at the next HP World (or next time you're in Boston).

Thanks...
Jim

Hindsight is 20/20

A. Clay Stephenson · ‎01-18-2005

I find it hard to believe that there is a problem related to Oracle patching under separate instamces and releases therefore I would concentrate on the OS. When you say there are no external connections that does not eliminate the local session connections.

You should do something like "select sid, username from v$session" to see what seesions are actually in play. I assume that shutdown immediate has no effect.

Normally, if Oracle is having problems with tunables, processes do not start and/or you see events in the alert logs.

The first question to ask is how recently patches is HP-UX? Have you seen anything in syslog? Does the box seem otherwise normal?

If it ain't broke, I can fix that.

Patrick Wallek · ‎01-18-2005

How many processes active on the system? How many oracle processes? Have you checked to see if you may be hitting a maxuprc limit when the listener stops (if it is running as oracle, or whatever you DB user is)?

Anything interesting in dmesg or syslog?

This is a bit of a puzzle......

By the way, I'd be careful with the beer offer, especially if Pete Randall comes up with a solution for you. ;)

Patrick Wallek · ‎01-18-2005

Another thought.....When you have problems shutting Oracle down, have you tried doing an 'fuser -cu' on your oracle filesystems to see what processes are still accessing them? You might give that a shot. If it's more than your basic ora_???? processes then you could have problems.

How long do you wait on the 'shutdown immediate' before doing a shutdown abort?

Steven E. Protter · ‎01-18-2005

You should try and make your kernel, especially maxuprc meet or at least slightly exceed oracle's guidelines.

I've seen issues where maxuprc was an issue with other oracle products.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Chris Vail · ‎01-18-2005

Try lsof on the filesystems where the oracle files are mounted and see what processes are keeping them open. Kill them yourself the hard way. You might have a runaway process that won't die unless you hit it with something hard like a shutdown abort or a kill -9.

Quite frankly, I don't think I've ever done anything except a "shutdown abort" when I wanted the system to come down any time in the current century. I always blame the resultant corruption on the DBA. This doesn't work, of course, if you ARE the DBA.

Chris

TwoProc · ‎01-18-2005

Jim, it's probably just one piece of crud that's holding you up. In another shell besides the one you're doing the shutdown in - look for attachments that are not local first. Just run "ps -ef | grep LOCAL=NO" and see what's left out there after you've already run the "shutdown immeidate" command on the first screen. Start killing these off one a time, until you're system frees up. You should be able to look in $ORACLE_HOME/dbs/audit/xxx to determine where the connection to that process id that you killed off came from.
If after killing off the remote connections - start getting rid of the local ones one at a time - just don't get rid of the system processes! Anyway, what you're looking for is LOCAL=NO processes. Keep knocking them down until you see what ails you.

OK, onto other things that I've seen that can trip you up-> when you've done you're shutdown is "smon" still stuck out there? It could be that smon is trying to clean up all of the temp segments. When I saw it - it was a bug in 9i that got fixed in 9.2.0.5 . Are you on a version of Oracle earlier than 9.2.0.5?

We are the people our parents warned us about --Jimmy Buffett

Jim Mallett · ‎01-18-2005

I love this place....

Patching: I keep the system patched with all recommended patches (that apply) every two months. I applied 9 patches on 12/26 but have since backed them out worried that they might have contributed/caused the issue.
Shutdown: shutdown immediate is the shutdown of choice here, that is what was getting stuck (not always, 4 times in 2-3 weeks). The shutdown immediate was just hanging, I had to login and do a shutdown abort and it completed w/o issue. Prior to the shutdown abort I did a ps for local=no and there were no "external" connections.
Logs: I have gone through every Oracle logfile I am aware of and nothing (until tonight). I noticed the webdb 2000_listener.log had a (errno 233) the night before the listener failed each time. Even after the error, the listener worked fine and users could connect, but eventually port 2000 would stop listening (8-10 hrs later). I will look into that this evening, I vaguely remember seeing a reference to errno 233 in a post I searched through.
As far as the syslogs go, nothing unusual. As far as the box goes, no other complaints other than the normal end-user performance complaints. When they occur I usually find that 3-4 people are generating large financial reports and the processors get a bit busy. Never any extended period of time though.
Patrick:
Processes: Each time I have checked it's about 400 processes, I didn't break out the Oracle count.
Beer: Even Pete would earn his bartab on this one Patrick. I'm getting burnt out staying up till midnight watching the BCVs and Oracle processes. Then getting up first thing to watch the AM processes. Although my scripting is improving! I've got checks for everything now.
Shutdown: The first couple of times the shutdown immediate was hung up till the next morning. Now I monitor it at night and give it no more than 1/2 hour. I haven't done a fuser yet, I think I focused on the ps -ef's.
Stephen:
Going to update the kernel this evening. Thanks to a previous response by you to someone else I noticed at least 3 settings were lower than recommended.
Chris/John:
I'm not the DBA, so I'm supposed to be "hands-off" with the DB except for the normal scripts. My limited (dangerous) knowledge of Oracle is what's gotten me by so far. I'm not getting any feedback from the DBA either though. He is telling me exactly what I'm seeing, no errors are being reported. But I keep digging. I'm running one instance at 8.1.7 (Financials) and two instances at 9.2.0.x. All are 64bit.

Although some of the supporting processes are running at 32bit. That's why initially I thought this may be a 32bit memory limitation issue or a shared memory issue.

First things first, I'm going to get the kernel updated now and cross my fingers. Then I'll look into that WebDB errno 233 issue. (I think that's a buffer issue)

I appreciate the time everybody has taken here, as I scroll up, I see I need to be less wordy.

Jim

Hindsight is 20/20

Gordon Morrison · ‎01-18-2005

Just a thought...
You don't say what this system is used for (or who by).
Are there any developers on this system? Are any of them doing anything with port 2000? As an ex-developer myself, I have learned that there is nothing quite like a developer to "muck up the gubbins" as it were.

What does this button do?

Stephen Keane · ‎01-18-2005

Jim, you might want to look at

http://forums1.itrc.hp.com/service/forums/questionanswer.do?admit=716493758+1106126606397+28353475&threadId=104237

regarding your WebDB error 233 (ENOBUFS)

Steve Lewis · ‎01-18-2005

More possibilities just in case:

1. As well as fuser, lsof on everything - check the sockets as well as file systems.

2. Is RMAN running? does it have a device open like the tape drive or something unexpected? You can use scsictl to check the device statuses.

3. Is it hanging on something it expects from another machine that is now down, such as replication of data in or out?

Rita C Workman · ‎01-19-2005

Hi Jim,

Looking down your kernel parms I might take a second look at:
maxfiles and maxfiles_lim
....I might increase the hard setting
maxuprc
....Like Stephen said, you need to increase this
ninode
....This is way too high, your wasting buffer creating a table this big. Do a sar -v and note what your really using then tune accordingly, 2048,4096.
nproc
....might want to consider a littler higher on this
npty,nstrpty,nstrtel
....not important, just a preference to set these beyond the default amount
semume
....look at this one please, you set your semmnu to 4092 and this to 10....may want to increase this.
shmmax
.....agree with you, please increase.

I do think you may have a problem in Oracle though...but eliminate any possibilities that may be O/S related first.
Your DBA guy sounds so familiar...

Just my thoughts,
Rita

Rita C Workman · ‎01-19-2005

Hi again Jim,

Found this thread with Stephen and Bill Hassell that I think is a good one to have around for shmmax....(and if you can your DBA to read it that would help too).

Rgrds,
Rita

http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=65987

TwoProc · ‎01-20-2005

Jim, I'm a bit shocked that the database is hanging and your DBA is not the one assigned the responsibility of getting the problem nailed down. I don't know how you've managed to accept the responsibility (other than you've got a big heart and a lot of class) - but a database hang *is* an issue the DBA should be taking a leadership role on working on with Oracle. Not saying you shouldn't be in there rendering all the assistance you can - you should. But this is clearly a DBA role.

If anyone thinks I'm overstepping my bounds here making a call on this - I apologize - but as senior staff member I'm the only one that's both DBA and sysadmin over here - so I'm comfortable telling you this.

Really - my $5 worth of free advice (which I acknowledge you didn't ask for - so I hope I'm not generating hard feelings by overstepping) - some serious leadership/initiative by your DBA is absolutely required here.

We are the people our parents warned us about --Jimmy Buffett

Jim Mallett · ‎02-01-2005

Just wanted to follow up on this one. I changed maxuprc=3686 and shmmax=2G, although I didn't expect this to resolve the issues they have not occurred since. The Oracle Admin may have made changes also but I have not been made aware of them.

I agree with you John, unfortunately I'm just the lowly Unix Admin. The org chart here would look something like:
Directors --> Managers --> Supervisors --> Oracle Admin --> Mainframe Admins/Programmers --> Operators --> Doorman --> The Rug You Wipe Your Feet On When You Enter The Building --> Unix Admin
And that's on a good day.

I'm just doing my time. The positive thing is I picked up a pretty good handle of ipcs and sar in January. I've also positioned myself better to handle issues going forward with accounting and increased dump space.

Thanks to all that contributed.

Jim

Hindsight is 20/20

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: System anomalies

System anomalies