Integrity Servers
1757087 Members
1641 Online
108858 Solutions
New Discussion юеВ

Re: Single Bit Error on RX3600

 
cam9269
Regular Advisor

Single Bit Error on RX3600

Hi Guys,

One of my integrity servers rebooted without producing a dump file on /var/adm/crash (bec. savecrash wasn't configured properly). The only clue that I got was the entries found in /var/opt/resmon/rst.log which has the following lines as description, I tried searching for similar cases here, and found that I may be encountering memory problems with my server, but the thing is, I need to know which module is having the SBE error as described here for me to replace it properly. Any got an idea how to do this? TIA!

=================================================
>------------ Event Monitoring Service Event Notification ------------<

Notification Time: Mon Mar 3 20:49:53 2008

hperpln sent Event Monitor notification information:

/system/events/memory_ia64/memory is >= 1.
Its current value is SERIOUS(4).



Event data from monitor:

Event Time..........: Mon Mar 3 20:49:53 2008
Severity............: SERIOUS
Monitor.............: memory_ia64
Event #.............: 1400
System..............: hperpln

Summary:
Memory Event Type : A memory page has been deallocated and entered into
the Page Deallocation Table (PDT).


Description of Error:

The Page Deallocation Table (PDT) is 0% full.

PDT Entries Used: 0
PDT Entries Free: 100
PDT Total Size: 100

All the memory pages have been deallocated due to excessive correctable
single bit errors being detected. This condition indicates a problem and
will result in loss of information.

Probable Cause / Recommended Action:

The Page Deallocation Table (PDT) is full and will overflow. If pages
continue to be deallocated, data loss may result. Contact your HP support
representative to check the memory boards.

Additional Event Data:
System IP Address...: 172.25.199.14
Event Id............: 0x47cbc9c100000002
Monitor Version.....: B.01.00
Event Class.........: Memory
Client Configuration File...........:
/var/stm/config/tools/monitor/rst_memory_ia64.clcfg
Client Configuration File Version...: H.01.00
Qualification criteria met.
Number of events..: 1
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: ia64 hp server rx3600
OS Version......................: B.11.31
System Firmware Version.........: 02.03
System Serial Number............: DEH47161D1
System Software ID..............: 0781790263
EMS Version.....................: A.04.20
STM Version.....................: D.02.00
System Current Product Number...: AB596A
System Original Product Number..: unavailable
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/memory_ia64.htm#1400

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v




>---------- End Event Monitoring Service Event Notification ----------<
5 REPLIES 5
Murat SULUHAN
Honored Contributor

Re: Single Bit Error on RX3600

Hi

Severity is not fatal.

Do you have any entry like /var/tombstones/ts99 with your same timestamp with your reboot

Best Regards
Murat
Murat Suluhan
cam9269
Regular Advisor

Re: Single Bit Error on RX3600

Hi Murat,

Thanks for replying, am sorry but the system seems did not produce and ts99 outputs on /var/tombstones, BTW, I'm on v11.31 (if that is any help). Did I miss any other configuration?

Since this error is non-fatal, it could recurr, is that right? If so, we want that stopped at this early stage already by finding out which memory module is failing and replace it immediately.

Regards
Stefan Stechemesser
Honored Contributor

Re: Single Bit Error on RX3600

Hi,

this is not an error, it is a known bug in the ISEE software. This message should be generated when the PDT is 100 % full but due to an error in the config file, it is generated when it is 0 % full empty.

ignore it.

The cause of the system reboot must be something else. The memory subsystem of the rx3600 is extremely reliable (up to 16 bit errors ("double byte errors") can be corrected ("double chip kill") in one cache line.

I would recomend to examine the System Event Log and Forward Progress Log from the Management Processor (MP) Main Menue with the "sl" command. Hopefully you find there additional events like power failures, environment problems or whatever could cause a system to go down without a trace.

Also the examination of the console log ("cl" command in MP Main Menu) shows in many cases interesting data (like panic strings etc.).

By the way, even if savecrash is not properly configured, it may be that the dump is still in the swap space. You can run savecrash manuall ("man savecrash") to save the dump to a location with enough space or directly onto a tape.
cam9269
Regular Advisor

Re: Single Bit Error on RX3600

Hi Stefan,

Thanks for the advise, very much appreciated. I've tried checking the SL/CL logs but when I got to the menu someone else has cleared it up and told me that he saved it, but with no clues as to the location. That's why I'm really at a loss at the moment. If the logs were really saved, would you know any default location I should be looking at?

Regards
Stefan Stechemesser
Honored Contributor

Re: Single Bit Error on RX3600

If online diagnostics is installed on that system, then the fpl_em monitor (which is part of the Event Monitoring Services (EMS)) logs the MP chassis codes also to a files (they are switched when a size of 256 kB is reached)

/var/stm/logs/fpl.log.XX

These files contain the chassis codes in binary format (A chassis code is simply a sequence of two 8-byte longwords). For analysis choose the fpl.log.* file with the timestamp that corresponds to the issue you are examining.

To view the logfile in the same format you would see on the MP, you can use the slview utility. You get an overview about the syntax by simply entering

/usr/sbin/diag/contrib/slview

Here an example:

cd /var/stm/logs/os
# /usr/sbin/diag/contrib/slview -f fpl.log.00

Welcome to the FPL (Forward Progress Log) Viewer 1.2


The following FPL navigation commands are available:
D: Dump log starting at current block for capture and analysis
F: Display first (oldest) block
L: Display last (newest) block
J: Jump to specified entry and display previous block
+: Display next (forward in time) block
-: Display previous (backward in time) block
: Repeat previous +/- command
?: Display help
q: Exit viewer

The following event format options are available:
K: Keyword
R: Raw hex
T: Text
V: Verbose

The following event filter options are available:
A: Alert level
C: Cell
U: Unfiltered

SL (,+,-,?,F,L,J,D,K,R,T,V,A,C,U,q) >
5036 HPUX 0,0,1 1 0x3f00033a01e00000 0x00000000000cf02f HP-UX_HEX_RUN_CODE
5035 HPUX 0,0,1 1 0x3f00033a01e00000 0x00000000000cf12f HP-UX_HEX_RUN_CODE
5034 HPUX 0,0,1 1 0x3f00033a01e00000 0x00000000000cf02f HP-UX_HEX_RUN_CODE
5033 HPUX 0,0,1 1 0x3f00033a01e00000 0x00000000000cf32f HP-UX_HEX_RUN_CODE
5032 HPUX 0,0,1 1 0x3f00033a01e00000 0x00000000000cf02f HP-UX_HEX_RUN_CODE
5031 HPUX 0,0,1 1 0x3f00033a01e00000 0x00000000000cf12f HP-UX_HEX_RUN_CODE
5030 HPUX 0,0,0 1 0x3f00033a00e00000 0x00000000000cf02f HP-UX_HEX_RUN_CODE
5029 HPUX 0,0,1 1 0x3f00033a01e00000 0x00000000000cf12f HP-UX_HEX_RUN_CODE
5028 HPUX 0,0,1 1 0x3f00033a01e00000 0x00000000000cf02f HP-UX_HEX_RUN_CODE
5027 HPUX 0,0,0 1 0x3f00033a00e00000 0x00000000000cf22f HP-UX_HEX_RUN_CODE
5026 HPUX 0,0,1 1 0x3f00033a01e00000 0x00000000000cf02f HP-UX_HEX_RUN_CODE
5025 HPUX 0,0,1 1 0x3f00033a01e00000 0x00000000000cf12f HP-UX_HEX_RUN_CODE
5024 HPUX 0,0,1 1 0x3f00033a01e00000 0x00000000000cf02f HP-UX_HEX_RUN_CODE
5023 HPUX 0,0,1 1 0x3f00033a01e00000 0x00000000000cf22f HP-UX_HEX_RUN_CODE
5022 HPUX 0,0,1 1 0x3f00033a01e00000 0x00000000000cf02f HP-UX_HEX_RUN_CODE
5021 BMC 2 0x2147cd48b70228c0 0xff0f066f001f0300 BOOT_FINISHED
5020 HPUX 0,0,0 1 0x3f00033300e00000 0x000000000006cef8 HP-UX_START_INIT
5019 HPUX 0,0,0 1 0x3f00033a00e00000 0x00000000000cf11f HP-UX_HEX_RUN_CODE
5018 HPUX 0,0,0 1 0x3f00033200e00000 0x000000000006cef6 HP-UX_START_PAGE_OUT_
DAEMON
5017 HPUX 0,0,0 1 0x3f00033100e00000 0x000000000006cef4 HP-UX_MOUNT_ROOT_FS
5016 HPUX 0,0,0 1 0x3f00032f00e00000 0x000000000006cef2 HP-UX_START_2ND_LVL_I
O_CONFIG
5015 HPUX 0,0,0 1 0x3f00032e00e00000 0x000000000006cef0 HP-UX_MAIN_ENTERED

SL (,+,-,?,F,L,J,D,K,R,T,V,A,C,U,q) > q

Note: normaly the system type is detected automatically. If this does not work, try:

/usr/sbin/diag/contrib/slview -f fpl.log.00 -p rx3600