1855735 Members
2414 Online
104103 Solutions
New Discussion

Re: AutoRAID and EMS

 
Jeff_Traigle
Honored Contributor

AutoRAID and EMS

Anyone know of a reason systems with an AutoRAID attached would send the following EMS message whenever make_tape_recovery is run? The two systems attached to this array appear to be the only ones that do this. (We have two other clusters with a similar configuration thaht do not generate the errors.) From the system perspective, everything remains fine and working, but EMS complains, which causes erroneous pages which makes all us support people unhappy.

>------------ Event Monitoring Service Event Notification ------------<

Notification Time: Wed Jun 14 09:09:13 2006

hostname sent Event Monitor notification information:

/storage/events/disk_arrays/AutoRAID/00000037E30A
is >= 3.
Its current value is SERIOUS(4).



Event data from monitor:

Event Time..........: Wed Jun 14 09:09:13 2006
Severity............: SERIOUS
Monitor.............: armmon
Event #.............: 101
System..............: hostname

Summary:
Disk Array at hardware path : Device removed from monitoring


Description of Error:

The device has been removed from the list of devices being monitored by
this monitor.

Probable Cause / Recommended Action:

The device was removed from the system, has stopped responding to the
system or it has been replaced with a device that is not supported by this
monitor.
Run ioscan to determine the state and type of the device.
Check the /var/stm/data/os_decode_xref for the information indicating
which devices are supported by this monitor.
Check other monitors to determine if they are now monitoring the
device by running /etc/opt/resmon/lbin monconfig and the using the
Check monitoring command.

Additional Event Data:
System IP Address...: XXX.XXX.XXX.XXX
Event Id............: 0x4490188900000001
Monitor Version.....: B.01.01
Event Class.........: I/O
Client Configuration File...........:
/var/stm/config/tools/monitor/default_armmon.clcfg
Client Configuration File Version...: A.01.01
Qualification criteria met.
Number of events..: 1
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/800/L2000-44
EMS Version.....................: A.04.20
STM Version.....................: A.52.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/armmon.htm#101

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v




>---------- End Event Monitoring Service Event Notification ----------<
--
Jeff Traigle
7 REPLIES 7
Phillip Thayer
Esteemed Contributor

Re: AutoRAID and EMS

Just a swag but, is it possible that when the make_tape_recovery is run it has to Quiesce the raid or SCSI bus that to be able to make the tape causing the raid to appear momentarily as if it were off-line or not available? I remember having problems with this type of behavior on other OS's in the past with SCSI busses.

Phil
Once it's in production it's all bugs after that.
A. Clay Stephenson
Acclaimed Contributor

Re: AutoRAID and EMS

I suppose that I would start by doing an arraydsp -a on both your "good" and "bad" configurations and look for differences in array settings and firmware levels. I still run 4 of these old AutoRAID's and run 'em hard and back 'em up nitely but I don't see these alerts. I would also do a pvdisplay on each of your LUN's and make sure that the IO Timeouts are in the 120 second range rather than the default 30 second value.
If it ain't broke, I can fix that.
Jeff_Traigle
Honored Contributor

Re: AutoRAID and EMS

Firmware is at HP62 on all of our AutoRAIDs. I don't see any differences in this array vs. others offhand. pvdisplay shows defailt IO timeout on all of our LUNs on all of our AutoRAIDS.

One thing is different in this cluster than our others though, for what it's worth. We have 2 AutoRAIDs connected to the nodes. The other clusters using AutoRAID only have one. Both nodes in the cluster with the two AutoRAIDs report this EMS error on the same array.
--
Jeff Traigle
Steven E. Protter
Exalted Contributor

Re: AutoRAID and EMS

Shalom,

I recommend running interactive cstm/mstm or xstm and excercize things there. This could be a disk problem getting ready to happen.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
A. Clay Stephenson
Acclaimed Contributor

Re: AutoRAID and EMS

2 of my 12H's are connected to a single server and 2 are hung between 2 servers in an MC/SG cluster so I doubt the number of host connections is relevant. I would go ahead and run pvchange on each of the LUN's to set the IO Timeout to about 120 seconds - the suggested value for array LUN's. Note that these boxes share common cables and common terminators (inline terminators, I assume) so it's possible that you have something no more complicated than a slightly loose cable or a flaky terminator.
If it ain't broke, I can fix that.
A. Clay Stephenson
Acclaimed Contributor

Re: AutoRAID and EMS

I assume that you have done an arraylog -e ARRAYID and looked for events. I would also run logprint looking for types disk, ctrlr, and perf messages. Man logprint for details.
If it ain't broke, I can fix that.
Andrew Merritt_2
Honored Contributor

Re: AutoRAID and EMS

Hi Jeff,
Are there any other, possibly lower severity, EMS events being reported for the autoraid?

I have seen a similar problem with ordinary disk drives, where disk_em was reporting 101 events. In that case, the disks were somehow going into a NO_HW state (as seen in 'ioscan'), but we never got to the bottom of why that was.

Just to be clear, the EMS HW monitor, armmon, is not causing this problem, it is reporting the fact that the device appears to no longer be present, and therefore can't be monitored.

What did you see when you ran ioscan, as the text recommends?

Andrew