Operating System - Tru64 Unix
cancel
Showing results for 
Search instead for 
Did you mean: 

messages - CPU error

Karthik S S
Honored Contributor

messages - CPU error

Hi,

I get the following error repeated ly on our alpha server 4100 (Tru64 OSF1 v5.1). What could be the problem?
------------------------------------
Mar 15 12:37:08 kyle last message repeated 2 times
Mar 15 12:37:08 kyle vmunix: WARNING: too many System corrected errors detected
on cpu 24. Reporting suspended.
Mar 15 12:37:08 kyle vmunix: WARNING: too many System corrected errors detected
on cpu 16. Reporting suspended.
Mar 15 12:37:08 kyle last message repeated 2 times
Mar 15 12:37:08 kyle vmunix: WARNING: too many Processor corrected errors detect
ed on cpu 24. Reporting suspended.
Mar 15 12:37:08 kyle vmunix: WARNING: too many Processor corrected errors detect
ed on cpu 24. Reporting suspended.
Mar 15 12:37:08 kyle vmunix: datalink: links=128, macs=6
Mar 15 12:37:09 kyle vmunix: /var: file system full
Mar 15 12:37:09 kyle vmunix: WARNING: too many Processor corrected errors detect
ed on cpu 24. Reporting suspended.
Mar 15 12:37:25 kyle vmunix: Environmental Monitoring Subsystem Configured.
Mar 15 12:37:52 kyle vmunix: SuperLAT. Copyright 1994 Meridian Technology Corp.
All rights reserved.
Mar 15 12:38:00 kyle vmunix: netbeui_configure(CFG_OP_CONFIGURE)
------------------------------------

psrinfo reports no error.

Pl. help.

Thanks,
Karthik S S
For a list of all the ways technology has failed to improve the quality of life, please press three. - Alice Kahn
22 REPLIES
Karthik S S
Honored Contributor

Re: messages - CPU error

uerf -r 100 shows,

(o/p truncated)

# uerf -r 100 | more
uerf version 4.2-011 (122)


********************************* ENTRY 1. *********************************
----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 100. CPU EXCEPTION
SEQUENCE NUMBER 2.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Wed May 28 18:18:10 2003
OCCURRED ON SYSTEM kyle
SYSTEM ID x00070016
SYSTYPE x00000002
PROCESSOR COUNT 4.
PROCESSOR WHO LOGGED x00000000

----- UNIT INFORMATION -----

UNIT CLASS CPU

********************************* ENTRY 2. *********************************



-----------

-Karthik S S
For a list of all the ways technology has failed to improve the quality of life, please press three. - Alice Kahn
Michael Schulte zur Sur
Honored Contributor

Re: messages - CPU error

Hi,

besides the /var full problem, which I hope, you have corrected by now, I would assume, it is a cpu problem. Use decevent to look into the binary errorlog. It is much more detailed. Anyway, this is a case for opening a call with HP.

greetings,

Michael
Karthik S S
Honored Contributor

Re: messages - CPU error

Hi Michael,

I am new to Tru64 and I am not in front of the system. Infact I am helping another user with this problem. After reading your reply, I just realized that /var is full ..!! But, I wonder how did you assume that I corrected this problem?? :-))

I will try to free up some space. By the way what is the realtion b/w /var filesystem and the cpu error messages?

Thanks,
Karthik S S
For a list of all the ways technology has failed to improve the quality of life, please press three. - Alice Kahn
Karthik S S
Honored Contributor

Re: messages - CPU error

Oh my ...

that info. is right there in the messages file :-( ... I didn't go through it properly ..

Thanks,
Karthik S S
For a list of all the ways technology has failed to improve the quality of life, please press three. - Alice Kahn
Michael Schulte zur Sur
Honored Contributor

Re: messages - CPU error

Hi Karthik,

you were so kindly to post it! ;-))
Mar 15 12:37:09 kyle vmunix: /var: file system full

And I didn't want to insult you by assuming you would not see it! ;-)

Now to your question:
There is no relation between file system full and cpu errors.

I have seen these errors more than once.

call HP.

greetings,

Michael
Hein van den Heuvel
Honored Contributor

Re: messages - CPU error

> Now to your question:
> There is no relation between file system full and cpu errors.

Other then /var/adm/messages, syslog.dated and other stuff filling up recording those error messages :-).

Why does it report "cpu 16" and "cpu 24"?
Are those the hw-id's for you cpu's?
Maybe check with 'hwmgr -v h" ?

If this would happen to a box of mine I would give it one chance for a 'hardware reset'. Power down, re-seat the cpu modules, power up. If it comes back, the it was a serious problem, like a cpu cache failure.

fwiw,
Hein.

Michael Schulte zur Sur
Honored Contributor

Re: messages - CPU error

Hi,

I think, Hein might be right with the hardware id.

If it comes back, then it was a serious problem. If not, then it is a serious problem.
Oh, now I see, you meant the problem and not the machine. ;-))

If you can shutdown the machine for 30min, you can run a test from the console prompt.

greetings,

Michael

Dawn Urey
Occasional Advisor

Re: messages - CPU error

I just recently had the same problem with my ES40. I had to replace memory dimms on my system. The memory was actually reporting problems which showed up in my error logs as machine checks. You may want to get HP to diagnose your binary.errlog.
Karthik S S
Honored Contributor

Re: messages - CPU error

hwmgr o/p (truncated)

----------
HWID: hardware hierarchy
-----------------------------------------------------------------
1: platform AlphaServer 4100 5/600 8MB
2: cpu CPU0
3: cpu CPU1
4: cpu CPU2
5: cpu CPU3
9: bus mcbus0
10: connection mcbus0slot5
11: bus pci1
12: connection pci1slot1
22: scsi_adapter psiop0
23: scsi_bus scsi0
52: disk bus-0-targ-5-lun-0 cdrom0
14: connection pci1slot2
24: scsi_adapter isp0
25: scsi_bus scsi1
53: disk bus-1-targ-1-lun-0 dsk0
54: disk bus-1-targ-3-lun-0 dsk1
55: disk bus-1-targ-5-lun-0 dsk2
16: connection pci1slot3
----------

no idea why it reports CPU 16 and 24.

-Karthik S S
For a list of all the ways technology has failed to improve the quality of life, please press three. - Alice Kahn
Karthik S S
Honored Contributor

Re: messages - CPU error

Thank you all ...

I will ask the user to call up HP.

Thanks,
Karthik S S
For a list of all the ways technology has failed to improve the quality of life, please press three. - Alice Kahn
Mobeen_1
Esteemed Contributor

Re: messages - CPU error

Karthik,
I have seen this happen many a times as many of our colleagues have suggested you could try

1. To power down your machine and refix the
CPUs or even swap them and see how things
go. This will also confirm whether the
error is actually on the CPU (as the
CPU position changes when you swap, the
error if any should give you a different
CPU location)

2. Use Decevent to look for any errors

Many a times these errors are caused by environmental factors and mostly due to CPU fan failures. But in the cases where CPU fans have failed, it will for sure give you a message on the fan failure.

If i were you, i would log a call with HP and have them change the CPUs in question. It really depends on the criticality of these servers at your site. In my case, i cannot afford a downtime and so if there is any doubt, just HP and myself will make a decision to replace them without taking any chances

I also have seen many posts in this forum with same issues. It would be interesting to see what resolutions others have carried out without having to replace the CPUs.

regards
Mobeen
Karthik S S
Honored Contributor

Re: messages - CPU error

Mobeen,

Thanx I will try that before calling hp...

by the way, how do I use decevent? is it a command??

Thanks,
Karthik S S
For a list of all the ways technology has failed to improve the quality of life, please press three. - Alice Kahn
Mobeen_1
Esteemed Contributor

Re: messages - CPU error

Hello Karthik,
Please review the link below and you will be able to use those commands.

http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V40G_HTML/AQTLSBTE/DOCU_008.HTM

Let me know if you still have issues trying to figure it out.

regards
Mobeen
Michael Schulte zur Sur
Honored Contributor

Re: messages - CPU error

Karthik,

the easiest way to use decevent is:
dia -R | more
to let it go backward in time.

Michael
Karthik S S
Honored Contributor

Re: messages - CPU error

Thanks Michael,

Some of the output from decevent,

------

Machine Check Reason x0086 Alpha Chip Detected ECC Err, From B-Cache
stdin
Ext Interface Status Reg xFFFFFFF085FFFFFF
DATA SOURCE IS BCACHE
CORRECTABLE ECC ERROR
D-ref fill

Machine Check Reason x0086 Alpha Chip Detected ECC Err, From B-Cache

Ext Interface Status Reg xFFFFFFF085FFFFFF
DATA SOURCE IS BCACHE
CORRECTABLE ECC ERROR
D-ref fill
EV5 Chip Rev 5
Ext Interface Address Reg xFFFFFF0029831EAF
Fill Syndrome Reg x000000000000B500
Interrupt Summary Reg x0000000100000000
Correctable ECC Errors (IPL31)
AST Requests 3-0: x0000000000000000

WHOAMI x00000000 CPU0 Detected This Error

---------------------------

looks like a CPU cache / memory error.

Thanks,
Karthik S S
For a list of all the ways technology has failed to improve the quality of life, please press three. - Alice Kahn
Mobeen_1
Esteemed Contributor

Re: messages - CPU error

Karthik,
From the DECevent output it looks like this is a correctable cache error. I would suggest that you power down the machine and take this opportunity to remove the CPUs and put them back into their sockets :-)

I think the cache will be cleared totally when the machine is powered down.

Regards
Mobeen
Karthik S S
Honored Contributor

Re: messages - CPU error

Thank you Mobeen .. I have asked my colleague to do that. Machine is not located in my office :-)


-Karthik S S
For a list of all the ways technology has failed to improve the quality of life, please press three. - Alice Kahn
Mobeen_1
Esteemed Contributor

Re: messages - CPU error

Karthik,
Thats great. I am sure that should helo.

I would appreciate if you could post back if things were ok after doing the same. Looks like many people are having similar issues on the Alphas :-) and your post will most certainly help them all.

regards
Mobeen

Karthik S S
Honored Contributor

Re: messages - CPU error

I will do that Mobeen .. but there may be a delay as I am not involved in this issue directly ..

-KarthiK S S
For a list of all the ways technology has failed to improve the quality of life, please press three. - Alice Kahn
Michael Schulte zur Sur
Honored Contributor

Re: messages - CPU error

Karthik,

if you have a maintenance contract for the machine, I would not hesitate and open a call with HP. Let them worry about the details. Depending on the importance of the machine, I wouldn't try to much to correct such problems myself. I guess, they will exchange the cpu.

Is it hard to get downtime for the machine?

Michael
Karthik S S
Honored Contributor

Re: messages - CPU error

Michael,

I have left the decision to my colleague. He told me that he will speak to the user and findout. It looks like they do not have a maintenance contract. Hence, he might settle for reseating the CPUs and see if the problem occurs again ... he might do it by this evening

-Karthik S S
For a list of all the ways technology has failed to improve the quality of life, please press three. - Alice Kahn
Karthik S S
Honored Contributor

Re: messages - CPU error

This was the reply I received from my colleague,

--------------------------------

Hi Karthik

The problem is solved at last !!!!!

As u told i removed all cpu s and connected one by one.
After some permutation and combination i noticed one cpu has gone bad.
I removed that cpu and now it is booting without any problem.

Thanks for ur time and help.

regards
jagga

-------------------------------------------


Thanks goes to ITRC :-)

-Karthik S S
For a list of all the ways technology has failed to improve the quality of life, please press three. - Alice Kahn