- Community Home
- >
- Servers and Operating Systems
- >
- Legacy
- >
- Operating System - Tru64 Unix
- >
- messages - CPU error
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2004 10:21 PM
03-14-2004 10:21 PM
messages - CPU error
I get the following error repeated ly on our alpha server 4100 (Tru64 OSF1 v5.1). What could be the problem?
------------------------------------
Mar 15 12:37:08 kyle last message repeated 2 times
Mar 15 12:37:08 kyle vmunix: WARNING: too many System corrected errors detected
on cpu 24. Reporting suspended.
Mar 15 12:37:08 kyle vmunix: WARNING: too many System corrected errors detected
on cpu 16. Reporting suspended.
Mar 15 12:37:08 kyle last message repeated 2 times
Mar 15 12:37:08 kyle vmunix: WARNING: too many Processor corrected errors detect
ed on cpu 24. Reporting suspended.
Mar 15 12:37:08 kyle vmunix: WARNING: too many Processor corrected errors detect
ed on cpu 24. Reporting suspended.
Mar 15 12:37:08 kyle vmunix: datalink: links=128, macs=6
Mar 15 12:37:09 kyle vmunix: /var: file system full
Mar 15 12:37:09 kyle vmunix: WARNING: too many Processor corrected errors detect
ed on cpu 24. Reporting suspended.
Mar 15 12:37:25 kyle vmunix: Environmental Monitoring Subsystem Configured.
Mar 15 12:37:52 kyle vmunix: SuperLAT. Copyright 1994 Meridian Technology Corp.
All rights reserved.
Mar 15 12:38:00 kyle vmunix: netbeui_configure(CFG_OP_CONFIGURE)
------------------------------------
psrinfo reports no error.
Pl. help.
Thanks,
Karthik S S
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2004 10:28 PM
03-14-2004 10:28 PM
Re: messages - CPU error
(o/p truncated)
# uerf -r 100 | more
uerf version 4.2-011 (122)
********************************* ENTRY 1. *********************************
----- EVENT INFORMATION -----
EVENT CLASS ERROR EVENT
OS EVENT TYPE 100. CPU EXCEPTION
SEQUENCE NUMBER 2.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Wed May 28 18:18:10 2003
OCCURRED ON SYSTEM kyle
SYSTEM ID x00070016
SYSTYPE x00000002
PROCESSOR COUNT 4.
PROCESSOR WHO LOGGED x00000000
----- UNIT INFORMATION -----
UNIT CLASS CPU
********************************* ENTRY 2. *********************************
-----------
-Karthik S S
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2004 11:16 PM
03-14-2004 11:16 PM
Re: messages - CPU error
besides the /var full problem, which I hope, you have corrected by now, I would assume, it is a cpu problem. Use decevent to look into the binary errorlog. It is much more detailed. Anyway, this is a case for opening a call with HP.
greetings,
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2004 11:21 PM
03-14-2004 11:21 PM
Re: messages - CPU error
I am new to Tru64 and I am not in front of the system. Infact I am helping another user with this problem. After reading your reply, I just realized that /var is full ..!! But, I wonder how did you assume that I corrected this problem?? :-))
I will try to free up some space. By the way what is the realtion b/w /var filesystem and the cpu error messages?
Thanks,
Karthik S S
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2004 11:24 PM
03-14-2004 11:24 PM
Re: messages - CPU error
that info. is right there in the messages file :-( ... I didn't go through it properly ..
Thanks,
Karthik S S
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2004 12:35 AM
03-15-2004 12:35 AM
Re: messages - CPU error
you were so kindly to post it! ;-))
Mar 15 12:37:09 kyle vmunix: /var: file system full
And I didn't want to insult you by assuming you would not see it! ;-)
Now to your question:
There is no relation between file system full and cpu errors.
I have seen these errors more than once.
call HP.
greetings,
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2004 01:56 AM
03-15-2004 01:56 AM
Re: messages - CPU error
> There is no relation between file system full and cpu errors.
Other then /var/adm/messages, syslog.dated and other stuff filling up recording those error messages :-).
Why does it report "cpu 16" and "cpu 24"?
Are those the hw-id's for you cpu's?
Maybe check with 'hwmgr -v h" ?
If this would happen to a box of mine I would give it one chance for a 'hardware reset'. Power down, re-seat the cpu modules, power up. If it comes back, the it was a serious problem, like a cpu cache failure.
fwiw,
Hein.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2004 03:01 AM
03-15-2004 03:01 AM
Re: messages - CPU error
I think, Hein might be right with the hardware id.
If it comes back, then it was a serious problem. If not, then it is a serious problem.
Oh, now I see, you meant the problem and not the machine. ;-))
If you can shutdown the machine for 30min, you can run a test from the console prompt.
greetings,
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2004 06:03 AM
03-15-2004 06:03 AM
Re: messages - CPU error
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2004 03:44 PM
03-15-2004 03:44 PM
Re: messages - CPU error
----------
HWID: hardware hierarchy
-----------------------------------------------------------------
1: platform AlphaServer 4100 5/600 8MB
2: cpu CPU0
3: cpu CPU1
4: cpu CPU2
5: cpu CPU3
9: bus mcbus0
10: connection mcbus0slot5
11: bus pci1
12: connection pci1slot1
22: scsi_adapter psiop0
23: scsi_bus scsi0
52: disk bus-0-targ-5-lun-0 cdrom0
14: connection pci1slot2
24: scsi_adapter isp0
25: scsi_bus scsi1
53: disk bus-1-targ-1-lun-0 dsk0
54: disk bus-1-targ-3-lun-0 dsk1
55: disk bus-1-targ-5-lun-0 dsk2
16: connection pci1slot3
----------
no idea why it reports CPU 16 and 24.
-Karthik S S
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2004 03:45 PM
03-15-2004 03:45 PM
Re: messages - CPU error
I will ask the user to call up HP.
Thanks,
Karthik S S
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2004 04:18 PM
03-15-2004 04:18 PM
Re: messages - CPU error
I have seen this happen many a times as many of our colleagues have suggested you could try
1. To power down your machine and refix the
CPUs or even swap them and see how things
go. This will also confirm whether the
error is actually on the CPU (as the
CPU position changes when you swap, the
error if any should give you a different
CPU location)
2. Use Decevent to look for any errors
Many a times these errors are caused by environmental factors and mostly due to CPU fan failures. But in the cases where CPU fans have failed, it will for sure give you a message on the fan failure.
If i were you, i would log a call with HP and have them change the CPUs in question. It really depends on the criticality of these servers at your site. In my case, i cannot afford a downtime and so if there is any doubt, just HP and myself will make a decision to replace them without taking any chances
I also have seen many posts in this forum with same issues. It would be interesting to see what resolutions others have carried out without having to replace the CPUs.
regards
Mobeen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2004 04:56 PM
03-15-2004 04:56 PM
Re: messages - CPU error
Thanx I will try that before calling hp...
by the way, how do I use decevent? is it a command??
Thanks,
Karthik S S
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2004 05:34 PM
03-15-2004 05:34 PM
Re: messages - CPU error
Please review the link below and you will be able to use those commands.
http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V40G_HTML/AQTLSBTE/DOCU_008.HTM
Let me know if you still have issues trying to figure it out.
regards
Mobeen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2004 06:35 PM
03-15-2004 06:35 PM
Re: messages - CPU error
the easiest way to use decevent is:
dia -R | more
to let it go backward in time.
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2004 06:43 PM
03-15-2004 06:43 PM
Re: messages - CPU error
Some of the output from decevent,
------
Machine Check Reason x0086 Alpha Chip Detected ECC Err, From B-Cache
stdin
Ext Interface Status Reg xFFFFFFF085FFFFFF
DATA SOURCE IS BCACHE
CORRECTABLE ECC ERROR
D-ref fill
Machine Check Reason x0086 Alpha Chip Detected ECC Err, From B-Cache
Ext Interface Status Reg xFFFFFFF085FFFFFF
DATA SOURCE IS BCACHE
CORRECTABLE ECC ERROR
D-ref fill
EV5 Chip Rev 5
Ext Interface Address Reg xFFFFFF0029831EAF
Fill Syndrome Reg x000000000000B500
Interrupt Summary Reg x0000000100000000
Correctable ECC Errors (IPL31)
AST Requests 3-0: x0000000000000000
WHOAMI x00000000 CPU0 Detected This Error
---------------------------
looks like a CPU cache / memory error.
Thanks,
Karthik S S
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2004 06:50 PM
03-15-2004 06:50 PM
Re: messages - CPU error
From the DECevent output it looks like this is a correctable cache error. I would suggest that you power down the machine and take this opportunity to remove the CPUs and put them back into their sockets :-)
I think the cache will be cleared totally when the machine is powered down.
Regards
Mobeen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2004 06:52 PM
03-15-2004 06:52 PM
Re: messages - CPU error
-Karthik S S
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2004 06:58 PM
03-15-2004 06:58 PM
Re: messages - CPU error
Thats great. I am sure that should helo.
I would appreciate if you could post back if things were ok after doing the same. Looks like many people are having similar issues on the Alphas :-) and your post will most certainly help them all.
regards
Mobeen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2004 07:03 PM
03-15-2004 07:03 PM
Re: messages - CPU error
-KarthiK S S
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2004 08:09 PM
03-15-2004 08:09 PM
Re: messages - CPU error
if you have a maintenance contract for the machine, I would not hesitate and open a call with HP. Let them worry about the details. Depending on the importance of the machine, I wouldn't try to much to correct such problems myself. I guess, they will exchange the cpu.
Is it hard to get downtime for the machine?
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2004 08:23 PM
03-15-2004 08:23 PM
Re: messages - CPU error
I have left the decision to my colleague. He told me that he will speak to the user and findout. It looks like they do not have a maintenance contract. Hence, he might settle for reseating the CPUs and see if the problem occurs again ... he might do it by this evening
-Karthik S S
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-18-2004 03:16 PM
03-18-2004 03:16 PM
Re: messages - CPU error
--------------------------------
Hi Karthik
The problem is solved at last !!!!!
As u told i removed all cpu s and connected one by one.
After some permutation and combination i noticed one cpu has gone bad.
I removed that cpu and now it is booting without any problem.
Thanks for ur time and help.
regards
jagga
-------------------------------------------
Thanks goes to ITRC :-)
-Karthik S S