1831168 Members
2844 Online
110021 Solutions
New Discussion

EMS Event Notification

 
Moez Alibhai
Advisor

EMS Event Notification

Hello
i have rp7410 running with hpux 11i, i get the following errors in the syslog file. Any with any idea what could be the problem.............
Value: "MAJORWARNING (3)" for Resource: "/system/events/core_hw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 209715202 -r /system/events/core_hw/core_hw -n 209715201 -a
12 REPLIES 12
David Burgess
Esteemed Contributor

Re: EMS Event Notification

Hi,

Did you run

/opt/resmon/bin/resdata -R 209715202 -r /system/events/core_hw/core_hw -n 209715201 -a

What did you get?

Regards,

Dave.
Moez Alibhai
Advisor

Re: EMS Event Notification

nothing happens, just goes to the root prompt ... do you an email id so i can mail you the syslog file
David Burgess
Esteemed Contributor

Re: EMS Event Notification

Are you not able to post the relvant section as an attachment?

Regards,

Dave.
RAC_1
Honored Contributor

Re: EMS Event Notification

Do dmesg.

When did this error happen? Grep for EMS in syslog.log

Anil
There is no substitute to HARDWORK
Moez Alibhai
Advisor

Re: EMS Event Notification

Mar 8 01:41:11 sdp1 EMS [27015]: ----- EMS Monitor Restart ----- Title: disk_em Command: /usr/sbin/stm/uut/bin/tools/monitor/disk_em Vendor: Hewlett-Packard Company Version: B.01.01 To obtain a list of currently monitored resources, execute the following: /opt/resmon/bin/resdata -M 1533300606
Mar 8 07:11:29 sdp1 EMS [3200]: ------ EMS Event Notification ------ Value: "SERIOUS (4)" for Resource: "/system/events/core_hw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 209715202 -r /system/events/core_hw/core_hw -n 209715436 -a
Mar 8 08:42:10 sdp1 EMS [9353]: ----- EMS Monitor Restart ----- Title: disk_em Command: /usr/sbin/stm/uut/bin/tools/monitor/disk_em Vendor: Hewlett-Packard Company Version: B.01.01 To obtain a list of currently monitored resources, execute the following: /opt/resmon/bin/resdata -M 1533300606
Mar 8 08:49:12 sdp1 su: + 1 root-snapadm
Mar 8 09:41:32 sdp1 EMS [3200]: ------ EMS Event Notification ------ Value: "MAJORWARNING (3)" for Resource: "/system/events/core_hw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 209715202 -r /system/events/core_hw/core_hw -n 209715437 -a
Mar 8 15:43:08 sdp1 EMS [25438]: ----- EMS Monitor Restart ----- Title: disk_em Command: /usr/sbin/stm/uut/bin/tools/monitor/disk_em Vendor: Hewlett-Packard Company Version: B.01.01 To obtain a list of currently monitored resources, execute the following: /opt/resmon/bin/resdata -M 1533300606
Mar 8 22:44:05 sdp1 EMS [11753]: ----- EMS Monitor Restart ----- Title: disk_em Command: /usr/sbin/stm/uut/bin/tools/monitor/disk_em Vendor: Hewlett-Packard Company Version: B.01.01 To obtain a list of currently monitored resources, execute the following: /opt/resmon/bin/resdata -M 1533300606
Mar 9 01:56:48 sdp1 EMS [3200]: ------ EMS Event Notification ------ Value: "MAJORWARNING (3)" for Resource: "/system/events/core_hw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 209715202 -r /system/events/core_hw/core_hw -n 209715438 -a
Mar 9 05:45:02 sdp1 EMS [28028]: ----- EMS Monitor Restart ----- Title: disk_em Command: /usr/sbin/stm/uut/bin/tools/monitor/disk_em Vendor: Hewlett-Packard Company Version: B.01.01 To obtain a list of currently monitored resources, execute the following: /opt/resmon/bin/resdata -M


...................


RAC_1
Honored Contributor

Re: EMS Event Notification

What does /opt/resmon/bin/resdata -R 209715202 -r /system/events/core_hw/core_hw -n 209715438 -a
return?

Anil
There is no substitute to HARDWORK
David Burgess
Esteemed Contributor

Re: EMS Event Notification

I'm guessing you have a disk problem. Do you have any scsi lbolt errors in the syslog or dmesg output? Do any on the commands in says to run give you more info? Do you have any obviously dead disks? Red lights etc?

Regards,

Dave.
Shaikh Imran
Honored Contributor

Re: EMS Event Notification

Hi,
What you get by running this

/opt/resmon/bin/resdata -R 209715202 -r /system/events/core_hw/core_hw -n 209715201 -a

This will definately give you some valid explanation.Don't forget even a single digit from the above syntax.
ALso you can see the mail to root.The output of this is also mailed to root i suppose.

Regards,
I'll sleep when i am dead.
Andrew Merritt_2
Honored Contributor

Re: EMS Event Notification

The events should also be logged to the /var/opt/resmon/log/event.log file, so have a look in there to see the full event details. What you'll need to do next depends on what events dm_core_hw is actually logging.

You seem to have possibly two problems; whatever dm_core_hw is logging, and the fact that disk_em, the disk monitor, is restarting every 7 hours.

What version of OnlineDiags do you have? Run 'cstm' and see what revision that shows; there were some problems with disk_em on some older versions. Check you have a recent version of the OnlineDiags with the latest patch for that version applied. For example, there was a problem with A.31.00 (HWE0203) which lead to disk_em restarting every 6 or 7 hours due to a file descriptor leak. This was fixed in subsequent revisions, and PHSS_27526 (now superseded by PHSS_28616) for that revision.

If you look in /etc/opt/resmon/log/api.log, there should be some trace indicating why disk_em is exiting.

http://www.docs.hp.com/hpux/onlinedocs/diag/stm/stm_upd.htm#table shows the revision ids.

Andrew
Moez Alibhai
Advisor

Re: EMS Event Notification

I get the following mesg...
CURRENT MONITOR DATA:

Event Time..........: Thu Mar 11 06:27:40 2004
Severity............: MAJORWARNING
Monitor.............: dm_core_hw
Event #.............: 80
System..............: sdp1

Summary:
Multiple cell controller to cell controller link errors in short
duration


Description of Error:

The cell controller (CC) chip has detected and corrected multiple errors
in data transferred to it from the cell controller to which it is
connected during a short time duration.

Probable Cause / Recommended Action:

There may be a problem with the cell controller.
Contact your HP support representative to check the cell board.

Additional Event Data:
System IP Address...: 192.1.1.1
Event Id............: 0x404fdcac00000000
Monitor Version.....: B.01.00
Event Class.........: System
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_core_hw.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 10
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/800/rp7410
OS Version......................: B.11.11
STM Version.....................: A.31.00
EMS Version.....................: A.03.20
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_core_hw.htm#80

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v


FRU Physical Location: 0x00ffff01ffffff93
FRU Source = 9 (cell)
Source Detail = 3 (coherency controller)
Cabinet Location = 0
Cell Location = 1

XIN_SEC_MODE..............: 0x0000000000000030
Link parity error on late 72 bits of data. The link identified in this
event had detected an error, but may not be the cause of it.
Link parity error on early 72 bits of data. The link identified in this
event had detected an error, but may not be the cause of it.
Andrew Merritt_2
Honored Contributor

Re: EMS Event Notification

Regarding the dm_core_hw event, I'd recommend following the suggested action and contacting your support representative as I'm not a hardware expert.

The event data also shows:
STM Version.....................: A.31.00

which means that the disk_em problem is what I referred to above, and you need to upgrade or install the patch to stop disk_em dying.

Andrew
Andrew Merritt_2
Honored Contributor

Re: EMS Event Notification

Just rechecked the details in that patch (PHSS_28616), and there is an explanation for the dm_core_hw events too:

JAGae23180
In an rp7410 system with one cell partitions, the dm_core_hw EMS hardware monitor may generate events #79, 80, 81, 82 or 83. This will typically occur when the other cell is powered down or not present.

If this matches your system, you'll also need to update the firmware as well as OnlineDiags to stop the events being generated.

Andrew