Operating System - Tru64 Unix
1753488 Members
4234 Online
108794 Solutions
New Discussion юеВ

Re: CPU is offline

 
Carlos A. Munoz Lopez
Frequent Advisor

CPU is offline

Hi guys! I've been having a problem with an AlphaServer GS80. The server has two processors, one of them is offline and I want to set it online but I can't. When I use

#psradm -n 1

displays the following error:

"psradm: processor 0: Invalid argument"

From sysman, it shows cpu1 has an error:

"CPU1 is unavailable and offline. Please use the event viewer to obtain information."

I tried to use sysman station to set it online but it failed again:

"The following errors ocurred while attempting to modify the selected CPUs:
Error for CPU 1:
This CPU cannot be placed online. It is most likely disabled at the system console. See documentation on the console's cpu_enabled variable."

I checked cpu_enabled paremeter and its value is FFFFF, so that means all processors in the servers should be online once the system is up, but processor 1 remains offline.

Some few days ago, the server kept halting frequently, so I checked binerrorlog and I found the following error:

"Problem Found: CPU1 short fill buffer over/underflow error of SoftQbb0"

"CPU1_Buf_Err is set in the QSD_ERR_SUM register. This bit indicates a short
fill buffer overflow and underflow error in the CPU1 interface in the QSD's
(quad switch data path). This error causes a fault condition.

Severity : 1"

Since that day, the CPU1 has been offline.

I would like some help on this issue. Thanks a lot.
4 REPLIES 4
Mark Poeschl_2
Honored Contributor

Re: CPU is offline

Looks like you've got some H/W issues. Call HP support and have them look at it.
Ninad_1
Honored Contributor

Re: CPU is offline

Hi,

Do you see any error messages on the console or in /var/adm/messages ??

Regards,
Ninad

Re: CPU is offline

Carlos:
While I agree with Mark Poeschl that you have a hardware issue, let's try a couple of things before HP tries to sell you a new EV68 that you'll have to wait a few weeks for ;-) This is going to be a wordy reply, Carlos but I'm doing it so maybe someone googling on the same problem will find this.

First off, do you happen to have Analyze (v.4 or better of WEBES I think) installed? Is that how you peeked in the BER? It's not really necessary but it might be nice if none of the stuff below works for you.

Next, I'm assuming you're running 5.1B-x for the sake of OLAR discussion because the indicted CPU is staying offline between reboots but it's academic anyway since you can't replace a CPU in a powered-up GS80 AFAIK. While you can certainly indict and poweroff that CPU (in fact it's been done for you) via the CI/DF functions, I don't think you can physically remove that CPU while the GS80 is running because there's no good access to it without tripping an electrical interlock. The 160s and up have more OLAR-friendly physical designs.

3rd, even though the CI system has marked the CPU as bad, offlined it, and written to the logs, have you looked at the cpu through hwmgr to see what it reports? Again, probably academic, but just being thorough. Doing maybe:

#hwmgr -view hier

you'll get something like:

..blah blah---------------------------
1: platform Compaq AlphaServer GS80 2/731
9: bus wfqbb0
10: connection wfqbb0slot0
11: bus wfiop0
12: connection wfiop0slot0
13: bus pci0
14: connection pci0slot1

o
o
...blah blah blah until you get to your CPUs
51: cpu qbb-0 CPU0
52: cpu qbb-0 CPU1

Now that you know the HWIDs of them, you can do:

#hwmgr -status comp -id 52 (whatever CPU1 is)

you'll get something that shows the CPU as "offline" and either "available" or not. The offline Access State will persist across reboots until you explicitly online it but the more important flag is the Health State. If it's "Available" then the CPU is probably good and you can just online it with hwmgr or sysman but I'm betting it shows otherwise. You can try to manually clear the indictment by doing

#hwmgr -power off -name CPU1 #explicitly powers it off
#hwmgr -unindict -name CPU1 #obvious
#hwmgr -power on -name CPU1 #powers it back on
#hwmgr -online -name CPU1

This is where you need WEBES 4, I think. I can't remember if you can alter indictment states without it. IF and only if the processor survives the online health check, it should resume operation and the O/S will start scheduling work automatically. If not, you're back to square one.

SO, let's assume it failed. Now you're going to have to halt and poweroff the system to gain physical access to the QBB where the "bad" cpu is. Might as well remove the cords for the power supplies to the QBB while you're at it. It's safer that way.

To wit: Last Fall, I used the OLAR method to replace a bad CPU in a GS160, resplendant in my white labcoat and rubber-soled Topsiders with matching static discharge wristband, of course ;-). When I leaned over the opened QBB, a metal fountain pen fell out of my pocket and into the machine, shorted across lots of shiny-blinky things and toasted the board. Fortunately we had depot parts to fix it with and the machine was only down a few hours, but it could have been avoided if power to the system had been removed. Anyway, I digress.

Depending upon age and environmental conditions your GS lives in, you might be able to get away with a simple remove-clean-replace operation but if that doesn't work, you probably really need a new CPU.

Once you do the remove-clean-replace business, reboot and see what hwmgr says. It's still going to be offline but it might be "available" as far as Health State is concerned. If so, you're good. Just power it on and online it; you don't have to scan for it or anything.

#hwmgr -power on -id 52 #whatever your CPU1 hwid is or use -name CPU1

#hwmgr -online -name CPU1 #or -id 52 (whatever your CPU1 hwid is)

The Automatic Deallocation Facilty that indicted and offlined your CPU is controlled by /etc/olar.config (or /etc/olar.config.common in a TruCluster). You can edit that file to specifically prohibit the ADF from working on components but I think it's all or none. You can turn it off for all CPUs but not just for CPU1. You probably don't want to mess with this since your system did exactly what it was supposed to do, but just know that you CAN.

I just had an afterthought: (DANGER: I'VE NEVER TRIED THIS SO YOU'RE ON YOUR OWN!!!) Here's a way to rule out other components and maybe confirm a fault with your suspect CPU. If you don't want to mess with this, skip down

1). With the suspect CPU1 out of the system, button the QBB back up and boot it back up again. Do:
#hwmgr -view hier #to verify the system doesn't see the suspect CPU.

Now
#shutdown -h now

2). Power it off, remove power, open the QBB, remove CPU0, and install CPU1 in its place.

3). Boot up again and see what hwmgr thinks of the CPU. Obviously, if it's busted you've never going to boot. Probably let it run a while to see if a fault develops. If no problems:

4). Power it off, remove power, open the QBB, install the former CPU0 in CPU1's slot and reboot. This is designed to rule out a fault in the CPU slot on the board. Again, let it run a while and check hwmgr output for CPU status. You'll probably have to poweron and online the hwid that represents CPU1 (the functional former CPU0) now because it should still be deallocated from before. If you get EVM messages indicting CPU0, then you know you've got a bum processor since it happens no matter where you install it. If you get an indictment of CPU1 then I'd suspect the slot since that processor was working fine when installed elsewhere.

OK. That's about all I can think of but if you're still having problems and you aren't ready to open a hardware case with HP yet, let me know.

Jack
_____________________
Jack M. Estes II, PhD
Managing Partner
Aegis Technology Partners, LLC
(rest autosnipped for spam control)
Carlos A. Munoz Lopez
Frequent Advisor

Re: CPU is offline

Ok, I'm going to try that out. I'll let you know. Thanks.