1748139 Members
3588 Online
108758 Solutions
New Discussion

HP-UX Server Crashing

 
GEP
Occasional Advisor

HP-UX Server Crashing

Anyone have any idea what this points to, we had a bad CPU, that was replaced and looks good now, but this server still crashes several times a day.  See this in logs

HP-UX_OS_CRITICAL_SHUTDOWN

FATAL_INTERRUPT_ISR

FATAL_INTERRUPT_ISR

FATAL_INTERRUPT_IFA

FATAL_INTERRUPT_IFA

FATAL_INTERRUPT_VECTOR

FATAL_INTERRUPT_VECTOR

FATAL_INTERRUPT_ISR

FATAL_INTERRUPT_IFA

FATAL_INTERRUPT_VECTOR

                                                                                             

 

9 REPLIES 9
Bill Hassell
Honored Contributor

Re: HP-UX Server Crashing

It would really help if the computer you are using was identified. Is this perhaps a vPAR and not a simple system?



Bill Hassell, sysadmin
GEP
Occasional Advisor

Re: HP-UX Server Crashing

This is a ia64 hp superdome server SD32B

Bill Hassell
Honored Contributor

Re: HP-UX Server Crashing

For a large machine such as the superdome, it is common to have multiple partitions, each with their own OS. When the system is running, can you post the output from this command:

# vparstatus



Bill Hassell, sysadmin
GEP
Occasional Advisor

Re: HP-UX Server Crashing

Will get that, I see this repeated often in the MP logs.


91660 SFW 0,6,7 0 0300113067e00000 2019080500062518 CMC_VALID_LOG

91659 SFW 0,6,7 2 4e801ca367e00000 088000000000304c UARCH_CHECK_INFO

91659 08/05/2019 07:49:22
91658 SFW 0,6,7 2 4e801c9f67e00000 0880098000220521 CACHE_CHECK_INFO

91658 08/05/2019 07:49:22

GEP
Occasional Advisor

Re: HP-UX Server Crashing

Here is the vparstatus, only one partition is in use.

 

[Complex]

   Complex Name :

   Complex Capacity

     Compute Cabinet (8 cell capable) : 1

   Active MP Location : cabinet 0

   Original Product Name : superdome server SD32B

   Original Serial Number :

   Current Product Order Number : A9834A

   OEM Manufacturer :

   Complex Profile Revision : 1.0

   The total number of partitions present : 1

   GSM sharing : Disabled Complex-wide

 

 

[Cabinet]

                  Cabinet   I/O       Bulk Power  Backplane

                  Blowers   Fans      Supplies    Power Boards

                  OK/       OK/       OK/         OK/

Cab               Failed/   Failed/   Failed/     Failed/

Num Cabinet Type  N Status  N Status  N Status    N Status       MP

=== ============  ========= ========= ==========  ============   ======

0   8 cell slot   4/0/N+    5/0/N+    6/0/N+      2/0/N+         Active

 

 

Notes: N+ = There are one or more spare items (fans/power supplies).

       N  = The number of items meets but does not exceed the need.

       N- = There are insufficient items to meet the need.

       ?  = The adequacy of the cooling system/power supplies is unknown.

       HO = Housekeeping only; The power is in a standby state.

       NA = Not Applicable.

 

 

[Cell]

                        CPU     Memory                                Use

                        OK/     (GB)                          Core    On

Hardware   Actual       Deconf/ OK/                           Cell    Next

Par

Location   Usage        Max     Deconf    Connected To        Capable Boot

Num

========== ============ ======= ========= =================== ======= ==== ===

cab0,cell0 Active Core  8/0/8   32.0/0.0  cab0,bay1,chassis3  yes     yes  0

 

cab0,cell1 Active Base  8/0/8   32.0/0.0  -                   no      yes  0

 

cab0,cell2 Active Base  8/0/8   32.0/0.0  cab0,bay0,chassis1  yes     yes  0

 

cab0,cell3 Active Base  8/0/8   32.0/0.0  -                   no      yes  0

 

cab0,cell4 Active Base  8/0/8   32.0/0.0  -                   no      yes  0

 

cab0,cell5 Active Base  8/0/8   32.0/0.0  -                   no      yes  0

 

cab0,cell6 Active Base  8/0/8   32.0/0.0  -                   no      yes  0

 

cab0,cell7 Active Base  8/0/8   32.0/0.0  -                   no      yes  0

 

 

Notes: * = Cell has no interleaved memory.

 

 

[Chassis]

                                 Core Connected  Par

Hardware Location   Usage        IO   To         Num

=================== ============ ==== ========== ===

cab0,bay0,chassis0  Absent       -    -          -

 

cab0,bay0,chassis1  Active       -    cab0,cell2 0

 

cab0,bay0,chassis2  Absent       -    -          -

 

cab0,bay0,chassis3  Absent       -    -          -

 

cab0,bay1,chassis0  Absent       -    -          -

 

cab0,bay1,chassis1  Absent       -    -          -

 

cab0,bay1,chassis2  Absent       -    -          -

 

cab0,bay1,chassis3  Active       -    cab0,cell0 0

 

 

 

[Partition]

Par              # of  # of I/O

Num Status       Cells Chassis  Core cell  Partition Name (first 30 chars)

=== ============ ===== ======== ========== ===============================

0   Active       8     2        cab0,cell0

 

 

 

[Partition - HyperThread]

Par Num      Hyperthreading Enabled  Hyperthreading Active

=======      ======================  =====================

0            yes                     yes            

GEP
Occasional Advisor

Re: HP-UX Server Crashing

This may be useful as well

 

MP:VWR (<cr>,<sp>,+,-,?,F,L,J,D,K,R,T,A,C,U,<Ctrl-b>) > 

Log Entry 40714:    08/05/2019 21:46:03

Alert level 2:  Informational

Keyword:  CACHE_CHECK_INFO

It indicates a cache error occurred and data field contains check info data.

Reporting Entity:  System Firmware located in cabinet 0, slot 6, cpu 7

Problem Detail:  0x188008c700220521

0x4e801c9f67e00053 0x188008c700220521

0x4b001c9f67e00054 0x010000005d48a39b

 

Log Entry 40713:    08/05/2019 21:46:03

Alert level 2:  Informational

Keyword:  CACHE_CHECK_INFO

It indicates a cache error occurred and data field contains check info data.

Reporting Entity:  System Firmware located in cabinet 0, slot 6, cpu 7

Problem Detail:  0x0880000000000900

0x4e801c9f67e00051 0x0880000000000900

0x4b001c9f67e00052 0x010000005d48a39b

Bill Hassell
Honored Contributor

Re: HP-UX Server Crashing

Clean out /var/adm/crash, then after the next crash, run crashinfo -v -c -H 2>&1 >/tmp
crashinfo is found in /usr/bin, /usr/local, or /usr/contrib

The html file in /tmp should provide the details you need to analyze the crash.



Bill Hassell, sysadmin
GEP
Occasional Advisor

Re: HP-UX Server Crashing

When you say ‘clean out’ /var/adm/crash are you wanting us to remove existing crash files or all files?

Currently, this is what the directory contains:

-rw------- 1 root sys 86 Aug 2 10:22 .sh_history

-rwxr-xr-x 1 root root 2 Aug 7 08:15 bounds

drwxr-xr-x 2 root root 8192 Aug 6 12:53 crash.52

drwxr-xr-x 3 root root 8192 Aug 7 16:22 crash.53

-r-xr-xr-x 1 root sys 14776 Jul 28 19:48 mca_190728.1

-r-xr-xr-x 1 root sys 14776 Aug 5 07:27 mca_19082.1

-r-xr-xr-x 1 root sys 14776 Aug 5 15:06 mca_19085.1

-rw------- 1 root sys 2276 Aug 2 10:22 typescript

 

Bill Hassell
Honored Contributor

Re: HP-UX Server Crashing

Since you have a recent set of crash files, just remove crash52. You need space in the /var/adm/crash to run crashinfo.
You can then run crashinfo.



Bill Hassell, sysadmin