1834643 Members
3261 Online
110069 Solutions
New Discussion

Re: isr.ior TOC panic

 
Jakes Louw
Trusted Contributor

isr.ior TOC panic

Hi guys,

One of our V-class servers had a bit of a panic, and HP here can't seem to diagnose.
Here's some detail. Anybody have a clue?

Reboot after panic: isr.ior = 0'240020.0'cc730dd0

q4> trace event 0
stack trace for event 0
crash event was a TOC
preArbitration+0x2e4
wait_for_lock+0x120
sl_retry+0x1c
pset_get_num_spu+0x18
pset_idle_loop+0x70c
idle+0x114
swidle_exit+0x0
Trying is the first step to failure - Homer Simpson
18 REPLIES 18
melvyn burnard
Honored Contributor

Re: isr.ior TOC panic

Reboot after panic: isr.ior = 0'240020.0'cc730dd0

Usually indicates either an HPMC, a user induced TOC, or a Serviceguard TOC.

as the q4 output says it is a TOC, I would suggest someone TOC'ed the system, or there may be a very rare event where a hardware issue has caused the tOC.

I would continue to pursue chasing your local HP Response Centre to look into this.
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Jakes Louw
Trusted Contributor

Re: isr.ior TOC panic

Hi Melvyn

We can rule out the human error.

So that leaves our great friend MC/SG.

What is the accepted code level for MC/SG on 11i these days?
11.15? and 2.02 for the API?
Trying is the first step to failure - Homer Simpson
Steve Steel
Honored Contributor

Re: isr.ior TOC panic

Hi


Not a lot to go on .

Do you use veritas .

How is patch level.

Is there a tombstone

A good patch level is especially important

Steve Steel
If you want truly to understand something, try to change it. (Kurt Lewin)
Mohanasundaram_1
Honored Contributor

Re: isr.ior TOC panic

Hi Jakes,

If this is a service-Guard TOC, then you need not worry about the TOC itself. You need to check the syslog.log (or OLDsyslog.log) on both the servers to determine if there was a cluster reformation and why there was reformation.

My guess is, you had problem with the heartbeat of the cluster and hence one of the nodes obtained the lock and the other one TOC'ed. This is a proper serviceguard behaviour.

If that is not the case, then I would suspect your patch levels. Q4 trace seems to indicate some deadlock(I am not sure). Get the system to a good patch level. Involve HP to analyse the dump.

If it is HPMC, then you need to involve HP. I presume there is no HPMC as you have indicated "HP here can't seem to diagnose."
I am sure HPMC is the first thing they would have checked and found nothing. So, they might be looking for problems in other area.

Cheers,
Mohan.
Attitude, Not aptitude, determines your altitude
Jakes Louw
Trusted Contributor

Re: isr.ior TOC panic

No veritas, June 2002 GoldQPK. No tombstone.

Still running Cluster Monitor A.11.13, Cluster API A.01.03.
Trying is the first step to failure - Homer Simpson
Jakes Louw
Trusted Contributor

Re: isr.ior TOC panic

Hi Mohanasundaram

I'm pretty convinced this is MC/SG. I've seen this previously: when working on a cluster, using accepted commands (cmhaltpkg, cmhaltnode) one of the other nodes decides that it needs some type of quorum, but then instead of just removing itself from the cluster it panics instead.

Hence my question on the latest trusted verion of CM.
Trying is the first step to failure - Homer Simpson
Steve Steel
Honored Contributor

Re: isr.ior TOC panic

Hi


Up your patch level

There has been a lot of changes in 2 years

Also upgrade your software from 11.13 to 11.14 if you can


Steve Steel
If you want truly to understand something, try to change it. (Kurt Lewin)
Jakes Louw
Trusted Contributor

Re: isr.ior TOC panic

Hi Steve

I hear what you say about the patches, but due to the sensitivity of the servers in question (telco billing production servers), we have to integrate a patch update into a test stream, and are constantly lagging 12 months behind. Our next patch update is scheduled for Oct 2004, and we will only be able to certify on Dec2003 GoldQPK. See my problem? I have to pin-point a patch so that I can motivate a fix-on-fail install.

But thanks for the responses: I sort of suspected MC/SG all along.
Trying is the first step to failure - Homer Simpson
Mohanasundaram_1
Honored Contributor

Re: isr.ior TOC panic

Hi Jakes,

I see your point. But if this is a telecom environment, then the vendor would have already given you the recommended and tested versions.

Is it Nokia/Lucent/Logica who has given you this solution? They will provide you the details you wanted to know.

Cheers,
Mohan.
Attitude, Not aptitude, determines your altitude
Dietmar Konermann
Honored Contributor

Re: isr.ior TOC panic

Jakes,

as Mohanasundaram already told you... if this TOC was caused by an expired safety timer then you need to look at the syslogs of all cluster node first (of course those that were active at dump time). You may also check if you find a core dump in /var/adm/cmcluster... cmcld may have crashed if the patch level is old.

Best regards...
Dietmar.
"Logic is the beginning of wisdom; not the end." -- Spock (Star Trek VI: The Undiscovered Country)
Jakes Louw
Trusted Contributor

Re: isr.ior TOC panic

Hi Mohanasundaram and Dietmar

Firstly: patch level Dec2003 is currently being tested in partnership with vendor/supplier. They will only certify this way since we are running a modified source of their app. We're talking about $13 million for each bill cycle, and there are 8 bill cycles per month. Do the math.....not a system to play with!
If I cause a revenue loss, I think beheading will be my choice of punishment!

On the side of CM: no core dump, but as I said previously, a definite activity on OTHER nodes on the same cluster.
Trying is the first step to failure - Homer Simpson
Mohanasundaram_1
Honored Contributor

Re: isr.ior TOC panic

Hi jakes,

I understand that the Telco systems are critical. We are not asking you to experiment with this system either.

We do not want your head on the chopping block :). But looking at some symptoms, It looks like somebody has already put your head on the block.

You said "no core dump". Was it not configured? or was there inadequate space to dump? If the server is so critical why such things are not monitored?
Then where did you run your Q4?

Are you sure this TOC was not as a result of any genuine network problems? thats what all the respondents here want to ascertain. Can you share with us what you found in the syslog.log?

I am sorry if I could not be of big help to you.

Cheers,
Mohan.
Attitude, Not aptitude, determines your altitude
Dietmar Konermann
Honored Contributor

Re: isr.ior TOC panic

Mohanasundaram, Jakes was talking about the cmcld core I was looking for. :-)

Jakes, I didn't ask you to play with this cluster. I just asked for syslog contents... up to now we only know of "a definite activity on OTHER nodes". Sorry, not enough to tell anything. Currently we are all reading tea leaves. Please post syslog extracts that show the history of the reformation (which must have happened when one of the node died).

Best regards...
Dietmar.
"Logic is the beginning of wisdom; not the end." -- Spock (Star Trek VI: The Undiscovered Country)
Mohanasundaram_1
Honored Contributor

Re: isr.ior TOC panic

:-)

Oops!I missed that Dietmar.

Jakes, I hope you found the root cause by this time. If so, just share it with us.

Cheers,
Mohan.

P.S Just call me "MOHAN".

Attitude, Not aptitude, determines your altitude
Jakes Louw
Trusted Contributor

Re: isr.ior TOC panic

Dietmar

I'll post the syslog stuff in a couple of days: up to my eye-balls in work @ the moment.
Trying is the first step to failure - Homer Simpson
Jakes Louw
Trusted Contributor

Re: isr.ior TOC panic

Guys, sorry about the 17 month delay in closure: I resigned, and I'm currently working for Sun....:-O
Anyway, I gave some points....
Trying is the first step to failure - Homer Simpson
Mohanasundaram_1
Honored Contributor

Re: isr.ior TOC panic

Hi Jakes,

DOn't tell me you resigned due to this problem :-)

With regards,
Mohan.
Attitude, Not aptitude, determines your altitude
Jakes Louw
Trusted Contributor

Re: isr.ior TOC panic

No, lovely job contracting.....;->
Trying is the first step to failure - Homer Simpson