Operating System - HP-UX
1832058 Members
3587 Online
110034 Solutions
New Discussion

Lost in possible hardware problem

 
SOLVED
Go to solution
Jason Berendsen
Regular Advisor

Lost in possible hardware problem

N class server running HP-UX 11

Yesterday I had a ioscan -fn hang on this server and was unable to kill it using the kill -9 hammer. Any ioscans done on the system hung. I power cycled the server and found it would panic after the alloc_pdc_pages section of the boot. Unfortunately, it states it was unable to do a dump. I am able to get the server into init state 3 after starting in maintenance mode. Thinking this is a hardware problem I went into XSTM and found everything in the green. I did an information check on the 3 processors and found that one CPU had HPMC codes and unknown errors. No other CPU had these symptoms. I chose to disable this cpu from the BCH and reset. When the server came back up it once again paniced. I again brought it up to init state 3 within maintenance mode and found that the now one of the other CPU's is showing the same errors and HPMC codes. What can account for this shift? Is this a problem with the system board?

Thanks,

Jason
13 REPLIES 13
Rita C Workman
Honored Contributor

Re: Lost in possible hardware problem

More than likely you are experiencing some kind of hardware memory error here.

I'd recommend calling in HP Hardware.

Rgrds,
Rita
Rita C Workman
Honored Contributor

Re: Lost in possible hardware problem

Did a little looking around to see if others had had similar problems and found this thread interesting....gives you a couple other things to check first..

http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=405806

Rgrds,
Rita


Jason Berendsen
Regular Advisor

Re: Lost in possible hardware problem

Tried once again to boot on a previous kernel to no avail.
Had the hardware guys systematically disassemble this server to test the memory and CPU's each.
Unfortunately, the system panics before it gets to the point where it can create a dump. All tests on these hardware pieces and on all PCI devices came up good. This leaves two big things, software problem or mother board. Does anyone know, if I am getting a system panic right after it reallocates the pdc (alloc_pdc_pages) during boot is this even at a point where software is involved? Could a software corruption cause this or is this definitely at a hardware point?
Todd McDaniel_1
Honored Contributor

Re: Lost in possible hardware problem

What about your swap disk? You say your symptom was a ioscan hang...

I might believe the system is panicing when it cant allocate swap at boot time.

Also, a lot of systems are setup with dump/swap on the same device...

Not sure on this but maybe your swap could have corrupted the memory/CPU...



-----------------------------------------------------------
I know this may be hindsight, but I have been burned by rebooting a box that is having trouble as yours did....

I have found that I am better off leaving the box up if it will remain semi-stable. And then troubleshooting it from there.

Sometimes rebooting can add additional problems/symptoms which can hide the true source of the problem.
Unix, the other white meat.
Ashwani Kashyap
Honored Contributor

Re: Lost in possible hardware problem

If its failing at PDC reallocation , its is doing a reallocation of PDC from ROM into RAM for faster access .

At this point two things can happen , either memory H/W problems or it is failing the PDC chksum between ROM and RAM . What version of PDC are you at . either PDC is corrupted , which might need a system board change , or might need a PDC upgrade .

Have HP take a look into these areas as well .
Ashwani Kashyap
Honored Contributor

Re: Lost in possible hardware problem

DO you see anything in the GSP logs .
Jason Berendsen
Regular Advisor

Re: Lost in possible hardware problem

Watching real closely I can see it is making it just past the alloc_pdc_pages. It looks like it is running down the buses. I see devices and paths scrolling by right when I get the system panic. I have attached a screen shot of what is seen at console.
Patrick Wallek
Honored Contributor

Re: Lost in possible hardware problem

My thinking on this is that you may have a bad disk, controller (scsi or fibre) or something like that.

They key is that you said your bdf command was hanging. That is usually indicative of a hard disk or controller or some similar failure where the system can no longer access all of your VG's and LV's.

Now since the system is panic'ing when you try to boot I would try some things:

1) unhook ALL external devices (fibre and scsi both) and try booting the system. If it succeeds hook one device at a time back up until it fails again.

2) If #1 doesn't work, you may have to start pulling I/O cards out and see if that has any effect. If the machine boots after a card is removed then you may have very well found your culprit.

Jason Berendsen
Regular Advisor

Re: Lost in possible hardware problem

Patrick,

The ioscan command is what hung and prompted us to do a reboot, not a bdf. We have removed all external devices and all PCI cards and the system still fails right after the alloc_pdc_pages. Once again we have one by one removed processors and memory to eliminate them as culprits. Any other ideas.
Jason Berendsen
Regular Advisor

Re: Lost in possible hardware problem

The problem has been resolved. After several tries I was able to get a screen print of the exact message at the time of failure. The error was "panic: lv_fixrootlv: Stale extent array overflow". Working with HP support this focused us on the root volume groups mirror. The fix was to reduce the root volume groups mirrors and reboot. The system then booted fine and we were able to extend the mirrors back. Unfortunately, I have no idea what caused this problem. Also, it is odd that we were geting a stale extent error when I checked all extents within vg00 and all were current.

Thanks for the help,

Jason
Todd McDaniel_1
Honored Contributor

Re: Lost in possible hardware problem

One more thing... DO you have hpux -lq set on your root disks?

I would set this parm, that might have gotten you past this error. I say might b/c I have never seen this error before.

Usu though, it should work.
Unix, the other white meat.
Steven E. Protter
Exalted Contributor

Re: Lost in possible hardware problem

interupt the boot at the 10 second prompt from the console.

sea


Are all your disks that you expect present?

If not the earlier suggestions are good.

A physical inspection of disks/cables power and termination are in order.

There are some diagnostics that the hardware folks can run from the ISL prompt.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Richard Pereira_1
Regular Advisor
Solution

Re: Lost in possible hardware problem

If you want to be sure theres nothing wrong with the disks try the following command, it should take a couple of seconds to run and will reveal any bad blocks. run this on both your primairy and mirror drives.

# echo 2400?20X | adb /dev/dsk/cxtxdx

should return this info ;

2400: 44454645 43543031 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0

any other non zero numbers indicate bad blocks and that disk should be changed.

another option would be to try a dd read from both drives, but that could take some time.