Operating System - Linux
1828262 Members
2560 Online
109975 Solutions
New Discussion

ICE-Linux mond issues with mdadm

 
Dave McLean
Occasional Advisor

ICE-Linux mond issues with mdadm

Have installed ICE-Linux 2.11 and after running Options-->Configure ICE-Linux Management Services on RHEL 5 nodes and mond starts up the following Critical alerts occur every 15 minutes.

Nov 4 14:56:58 usorl03p307 mdadm: DeviceDisappeared /dev/md0
Nov 4 14:56:58 usorl03p307 mdadm: DeviceDisappeared /dev/md2
Nov 4 14:56:58 usorl03p307 mdadm: DeviceDisappeared /dev/md1
Nov 4 14:56:59 usorl03p307 mdadm: DeviceDisappeared /dev/md0


Stopping mond stops the messages.
/etc/init.d/mond stop
20 REPLIES 20
Donna Firkser
Regular Advisor

Re: ICE-Linux mond issues with mdadm

Dave,

These critical alerts are associated with the "Syslog Alerts" Service, correct?

I'd like to see if I can reproduce this. What version of RH5 do you have installed on your managed nodes (e.g. 32bit or 64bit; update 1 or 2)?

If you're not interested in seeing these mdadm critical alerts you should be able to stop the alerts by modifying the /opt/hptc/nagios/etc/syslogAlertRules file.

Try this and let me know if the alerts stop.

Edit syslogAlertRules (make a backup copy first) and change the mdadm rule to look as follows (i.e. add DeviceDisappeared to the list of mdadm events to ignore).

rule mdadm_errors {
name (! /(NewArray)|(SparesMissing) (DeviceDisappeared)/)
relevance ($subsystem =~ /mdadm/)
format "$timestamp $message"
}

Thanks,
Donna





Dave McLean
Occasional Advisor

Re: ICE-Linux mond issues with mdadm

Thanks for the quick reply Donna. The HP case number for this issue is 4606099605. There is lots of logs and sysreports attached to the case if you can pull it up.

The RHEL version on the node is RHEL 5.4 x86_64 on BL495G5 blades in C7000 chassis.

Have been working with Mitch on other issues also but not this one.

We are interested in seeing valid mdadm alerts, but these are not valid and start after mond is stared.

I will make your suggested changes and report back.
Dave McLean
Occasional Advisor

Re: ICE-Linux mond issues with mdadm

By chance should ther be a "|" between (SparesMissing) (DeviceDisappeared) ???

maybe shoudl be: (SparesMissing)|(DeviceDisappeared)/)
Donna Firkser
Regular Advisor

Re: ICE-Linux mond issues with mdadm

Yes. You need to add the |.

rule mdadm_errors {
name (! /(NewArray)|(SparesMissing)|(DeviceDisappeared)/)
relevance ($subsystem =~ /mdadm/)
format "$timestamp $message"
}


Donna
Donna Firkser
Regular Advisor

Re: ICE-Linux mond issues with mdadm

And I should have noted that by making this edit you will still continue to get mdadm alerts just not DeviceDisappeared alerts.

Donna
Dave McLean
Occasional Advisor

Re: ICE-Linux mond issues with mdadm

The change did stop the alerts but /var/log/messages is still filling up with the bogus messages that start when mond sevice is started. every 15 minutes.

mond -> /opt/hptc/supermon/etc/init.d/mond-setup

with mond stopped there are no more messages generated in /var/log so there is something that ICE-Linus (supermon) is doing that is causing the message to occur in the first place.

Need to find the root cause that is causing the messages.

I can provide you a virtual room connection if it would help.
Donna Firkser
Regular Advisor

Re: ICE-Linux mond issues with mdadm

Here's what's happening inside Nagios/supermon.

On the CMS, vi /opt/hptc/nagios/etc/nagios_vars.ini. In this file you will see mdadminfo and MDAMDCOLLECTIONPERIOD.

MDADMCOLLECTION is set to 15 minutes which means on the target nodes, supermon will call /opt/hptc/mdadm/sbin/getMdadmEvents every 15 minutes. You can change this collection period to anything you like.

If you log in to one of you target nodes, you can look at /opt/hptc/mdadm/sbin/getMdadmEvents which calls mdadm-handler. mdadm-handler sends all messages returned by /sbin/mdadm to syslog.

We recently fixed an issue in our next IC-Linux release (V6.0) where this script was failing because it was being run as Nagios and not root so I'm wondering if your hitting that issue.

Can you run a test for me? On the target node, (as root) run /opt/hptc/mdadm/sbin/getMdadmEvents and tail /var/log/messages and let me know what you see.

Then login as Nagios (su - nagios) and run getMdadmEvents and let me know what you see in /var/log/messages.

In regards to the DeviceDisappeared event, do you think that /sbin/mdadm is incorrectly reporting this error? Or has the device really disappeared?

One work around I can think of is to modify mdadm-handler to check for the DeviceDisappeared event and not call syslog.

Donna
Dave McLean
Occasional Advisor

Re: ICE-Linux mond issues with mdadm

Ran the getMdadmEvents as both root and nagios. When ran as root no messages are generated in /var/log/messages.

When ran as nagio, each time the command getMdadmEvents generates:

Nov 6 13:45:53 usorl03p309 mdadm: DeviceDisappeared /dev/md1
Nov 6 13:45:53 usorl03p309 mdadm: DeviceDisappeared /dev/md0
Nov 6 13:45:59 usorl03p309 mdadm: DeviceDisappeared /dev/md2
Nov 6 13:45:59 usorl03p309 mdadm: DeviceDisappeared /dev/md1
Nov 6 13:45:59 usorl03p309 mdadm: DeviceDisappeared /dev/md0

I believe the messages are bogus and the devices are NOT disappearing.

dave

William Athanasiou
Occasional Advisor

Re: ICE-Linux mond issues with mdadm

Could you provide a description of your hardware and installation? Are you using software RAID? How many disks are installed? Is it possible you have a disk in the machine that used to be part of a SW RAID set? If you have an /etc/mdadm.conf file, can you include the contents?

I realize that's a lot of questions, but I'm just trying to figure out why mdadm would be reporting the error.
Dave McLean
Occasional Advisor

Re: ICE-Linux mond issues with mdadm

The hardware installation is a BL495G5 blade that has two internal SSD 64GB disks. The OS is RHEL 5.4 and mirrored acrossed the two internal drives.

mdadm.conf


# mdadm.conf written out by anaconda
DEVICE partitions
MAILADDR root
ARRAY /dev/md0 level=raid1 num-devices=2 uuid=aa4f5616:1f85a679:04e92872:8cb15fe7
ARRAY /dev/md1 level=raid1 num-devices=2 uuid=6787038e:e6c35d9c:fa5a0916:9729dd5f

dave

ARRAY /dev/md2 level=raid1 num-devices=2 uuid=c90d94d7:2f54ad8e:74248664:92872716
~
William Athanasiou
Occasional Advisor

Re: ICE-Linux mond issues with mdadm

Well, that all looks right. Can you attach the output of "cat /proc/mdstat"?
Dave McLean
Occasional Advisor

Re: ICE-Linux mond issues with mdadm

Output from /proc/mdstat

usorl03p309 ~ -1277> cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[1] sda1[0]
208704 blocks [2/2] [UU]

md1 : active raid1 sdb2[1] sda2[0]
12586816 blocks [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
49721088 blocks [2/2] [UU]

unused devices:
usorl03p309 ~ -1278>

dave
Mitchell Kulberg
Valued Contributor

Re: ICE-Linux mond issues with mdadm

Hey there Dave,

I'm curious. Are you able to reproduce this error on any other servers other than this one? any chance you've got USB devices on this server?

It's a long shot, but I've had questionable USB devices do that for real.

Thanks,
Mitch
Donna Firkser
Regular Advisor

Re: ICE-Linux mond issues with mdadm

Dave,

After further investigation it looks like this bogus DeviceDisappeared event is occurring because we are running mdadm as the nagios user. This is happening because we changed mond (which calls getMdadmEvents) to run as Nagios instead of root for security purposes. However, when we made this change we forgot to modify mdadm to use sudo so there's a defect in V2.11, in that we should be using "sudo /sbin/mdadm" inside getMdadmEvents.

This defect is fixed in the next IC-Linux release (V6.0) which should be available January 2010.

Do you know if Siemens is planning to move to V6.0 when it becomes available?

In the interim, You could manually work around this issue by making the following changes on every managed system. This is exact same fix that will be available in our V6.0 release.

1) Add the following line to /etc/sudoers on every managed system.
nagios ALL = NOPASSWD: /sbin/mdadm

And

2) Add "sudo" to the following line in /opt/hptc/mdadm/sbin/getMdadmEvents

`/usr/bin/sudo /sbin/mdadm --monitor --scan --program=/opt/hptc/mdadm/sbin/mdadm-handler --oneshot`;

Let me know if this helps.

Thanks,
Donna
Dave McLean
Occasional Advisor

Re: ICE-Linux mond issues with mdadm

Thanks Donna. That's sorta what it was looking like since user root seem to work ok. It's tough doing root level tasks and at the same time maintain security.

I'll give your suggestions a try and report back to you.


dave
Dave McLean
Occasional Advisor

Re: ICE-Linux mond issues with mdadm

Looks like the sudo trick worked.

Ready for another one? something is trying to open /dev/mcelog on 15 minute intervals and getting permission denied.

Nov 10 20:28:27 usorl03p309 mcelog: Cannot open /dev/mcelog
Nov 10 20:43:26 usorl03p309 mcelog: Cannot open /dev/mem for DMI decoding: Permission denied
Nov 10 20:43:26 usorl03p309 mcelog: Cannot open /dev/mcelog
Nov 10 20:58:27 usorl03p309 mcelog: Cannot open /dev/mem for DMI decoding: Permission denied


dave

Donna Firkser
Regular Advisor

Re: ICE-Linux mond issues with mdadm

Glad to hear that did the trick.

The mcelog event is the exact same issue so you need to apply the same work around.

1) Add /usr/sbin/mcelog to /etc/sudoers and
2) Add /usr/bin/sudo to the following line in /opt/hptc/mcelog/sbin/getMcelogEvents.

e.g.
`/usr/bin/sudo /usr/sbin/mcelog --syslog`;

These where the only two sudo issues fixed for V6.0, so you should be all set now.

Donna
Dave McLean
Occasional Advisor

Re: ICE-Linux mond issues with mdadm

Donna,

Applied the changes for mcelog also.

The last issue I'm working so far with Mitch is the wrong system name is being picked up when multiple IP's are plumbed up on the same NIC. Mitch should have all the details but maybe I'll open up a new forum on this one also.

Thanks for your support.

dave
Donna Firkser
Regular Advisor

Re: ICE-Linux mond issues with mdadm

Dave,

Mitch described the NIC/hostname issue to me. I'm going to try and reproduce it and will let you know what I find.

Donna
Donna Firkser
Regular Advisor

Re: ICE-Linux mond issues with mdadm

Dave,

I defined multiple NICs on managed system pluto as shown below and after I discovered it with SIM, I'm correctly seeing the one IP address for eth0 and host name pluto in SIM.

Is this configuration similar to your multi NIC configuration? Please open up a new forum entry for this discussion.

[root@poseidon image]# mxnode -ld pluto
System name: pluto
Host name: pluto.usa.hp.com
IP addresses: 16.118.197.34
OS name: LINUX


[root@pluto ~]# ifconfig
eth0 Link encap:Ethernet HWaddr 00:16:35:C6:C8:F6
inet addr:16.118.197.34 Bcast:16.118.207.255 Mask:255.255.240.0
inet6 addr: fe80::216:35ff:fec6:c8f6/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:6602599 errors:0 dropped:0 overruns:0 frame:0
TX packets:120564 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:651843140 (621.6 MiB) TX bytes:17638634 (16.8 MiB)
Interrupt:169 Memory:f6000000-f6012800

eth0:0 Link encap:Ethernet HWaddr 00:16:35:C6:C8:F6
inet addr:16.118.197.163 Bcast:16.255.255.255 Mask:255.0.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:169 Memory:f6000000-f6012800

eth0:1 Link encap:Ethernet HWaddr 00:16:35:C6:C8:F6
inet addr:16.118.198.249 Bcast:16.255.255.255 Mask:255.0.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:169 Memory:f6000000-f6012800

eth0:2 Link encap:Ethernet HWaddr 00:16:35:C6:C8:F6
inet addr:16.118.199.254 Bcast:16.255.255.255 Mask:255.0.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:169 Memory:f6000000-f6012800

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:613664 errors:0 dropped:0 overruns:0 frame:0
TX packets:613664 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:238774063 (227.7 MiB) TX bytes:238774063 (227.7 MiB)

Thanks,
Donna