Operating System - Linux
1751707 Members
5500 Online
108781 Solutions
New Discussion юеВ

Re: ICE-Linux mond issues with mdadm

 
Dave McLean
Occasional Advisor

ICE-Linux mond issues with mdadm

Have installed ICE-Linux 2.11 and after running Options-->Configure ICE-Linux Management Services on RHEL 5 nodes and mond starts up the following Critical alerts occur every 15 minutes.

Nov 4 14:56:58 usorl03p307 mdadm: DeviceDisappeared /dev/md0
Nov 4 14:56:58 usorl03p307 mdadm: DeviceDisappeared /dev/md2
Nov 4 14:56:58 usorl03p307 mdadm: DeviceDisappeared /dev/md1
Nov 4 14:56:59 usorl03p307 mdadm: DeviceDisappeared /dev/md0


Stopping mond stops the messages.
/etc/init.d/mond stop
20 REPLIES 20
Donna Firkser
Regular Advisor

Re: ICE-Linux mond issues with mdadm

Dave,

These critical alerts are associated with the "Syslog Alerts" Service, correct?

I'd like to see if I can reproduce this. What version of RH5 do you have installed on your managed nodes (e.g. 32bit or 64bit; update 1 or 2)?

If you're not interested in seeing these mdadm critical alerts you should be able to stop the alerts by modifying the /opt/hptc/nagios/etc/syslogAlertRules file.

Try this and let me know if the alerts stop.

Edit syslogAlertRules (make a backup copy first) and change the mdadm rule to look as follows (i.e. add DeviceDisappeared to the list of mdadm events to ignore).

rule mdadm_errors {
name (! /(NewArray)|(SparesMissing) (DeviceDisappeared)/)
relevance ($subsystem =~ /mdadm/)
format "$timestamp $message"
}

Thanks,
Donna





Dave McLean
Occasional Advisor

Re: ICE-Linux mond issues with mdadm

Thanks for the quick reply Donna. The HP case number for this issue is 4606099605. There is lots of logs and sysreports attached to the case if you can pull it up.

The RHEL version on the node is RHEL 5.4 x86_64 on BL495G5 blades in C7000 chassis.

Have been working with Mitch on other issues also but not this one.

We are interested in seeing valid mdadm alerts, but these are not valid and start after mond is stared.

I will make your suggested changes and report back.
Dave McLean
Occasional Advisor

Re: ICE-Linux mond issues with mdadm

By chance should ther be a "|" between (SparesMissing) (DeviceDisappeared) ???

maybe shoudl be: (SparesMissing)|(DeviceDisappeared)/)
Donna Firkser
Regular Advisor

Re: ICE-Linux mond issues with mdadm

Yes. You need to add the |.

rule mdadm_errors {
name (! /(NewArray)|(SparesMissing)|(DeviceDisappeared)/)
relevance ($subsystem =~ /mdadm/)
format "$timestamp $message"
}


Donna
Donna Firkser
Regular Advisor

Re: ICE-Linux mond issues with mdadm

And I should have noted that by making this edit you will still continue to get mdadm alerts just not DeviceDisappeared alerts.

Donna
Dave McLean
Occasional Advisor

Re: ICE-Linux mond issues with mdadm

The change did stop the alerts but /var/log/messages is still filling up with the bogus messages that start when mond sevice is started. every 15 minutes.

mond -> /opt/hptc/supermon/etc/init.d/mond-setup

with mond stopped there are no more messages generated in /var/log so there is something that ICE-Linus (supermon) is doing that is causing the message to occur in the first place.

Need to find the root cause that is causing the messages.

I can provide you a virtual room connection if it would help.
Donna Firkser
Regular Advisor

Re: ICE-Linux mond issues with mdadm

Here's what's happening inside Nagios/supermon.

On the CMS, vi /opt/hptc/nagios/etc/nagios_vars.ini. In this file you will see mdadminfo and MDAMDCOLLECTIONPERIOD.

MDADMCOLLECTION is set to 15 minutes which means on the target nodes, supermon will call /opt/hptc/mdadm/sbin/getMdadmEvents every 15 minutes. You can change this collection period to anything you like.

If you log in to one of you target nodes, you can look at /opt/hptc/mdadm/sbin/getMdadmEvents which calls mdadm-handler. mdadm-handler sends all messages returned by /sbin/mdadm to syslog.

We recently fixed an issue in our next IC-Linux release (V6.0) where this script was failing because it was being run as Nagios and not root so I'm wondering if your hitting that issue.

Can you run a test for me? On the target node, (as root) run /opt/hptc/mdadm/sbin/getMdadmEvents and tail /var/log/messages and let me know what you see.

Then login as Nagios (su - nagios) and run getMdadmEvents and let me know what you see in /var/log/messages.

In regards to the DeviceDisappeared event, do you think that /sbin/mdadm is incorrectly reporting this error? Or has the device really disappeared?

One work around I can think of is to modify mdadm-handler to check for the DeviceDisappeared event and not call syslog.

Donna
Dave McLean
Occasional Advisor

Re: ICE-Linux mond issues with mdadm

Ran the getMdadmEvents as both root and nagios. When ran as root no messages are generated in /var/log/messages.

When ran as nagio, each time the command getMdadmEvents generates:

Nov 6 13:45:53 usorl03p309 mdadm: DeviceDisappeared /dev/md1
Nov 6 13:45:53 usorl03p309 mdadm: DeviceDisappeared /dev/md0
Nov 6 13:45:59 usorl03p309 mdadm: DeviceDisappeared /dev/md2
Nov 6 13:45:59 usorl03p309 mdadm: DeviceDisappeared /dev/md1
Nov 6 13:45:59 usorl03p309 mdadm: DeviceDisappeared /dev/md0

I believe the messages are bogus and the devices are NOT disappearing.

dave

William Athanasiou
Occasional Advisor

Re: ICE-Linux mond issues with mdadm

Could you provide a description of your hardware and installation? Are you using software RAID? How many disks are installed? Is it possible you have a disk in the machine that used to be part of a SW RAID set? If you have an /etc/mdadm.conf file, can you include the contents?

I realize that's a lot of questions, but I'm just trying to figure out why mdadm would be reporting the error.