HPVM guests as Serviceguard nodes- monitoring critical resources

RSA · ‎07-24-2012

Given the following environment: HP/UX 11i v3 September 2011 (hosts and guests), Integrity VM 4.30, Serviceguard 11.20, SGeSAP 05.10

Two HPVM guests setup as Serviceguard nodes:
- Whole disks used as backing storage (AVIO devices)
- For each HPVM guest two fibre channel SAN LUNs as backing storage, one for OS and one for application including data
- HA failover capabilities needed like in traditional Serviceguard cluster configurations
- Application failover rather than HPVM guest failover

What is the best approach to monitor critical resources like physical volumes used in the volume groups for the HPVM guest operating systems as well as the application binaries and data configured in the Serviceguard package?

My idea would be to use SFM/WBEM, for example having a script executing the CIMUtil -e root/cimv2 HP_DiskDrive command to get the volume group disk status. This could be added to the package configuration using the sg/service and sg/generic_resource modules. But what happens if the filesystem storing the monitoring script isn't available anymore? IO timeouts...

What happens if for example the storage array providing the LUNs for the HPVM guest operating system and application data goes down? Everything in memory will continue to run, like for example the server is still pingable, cmcld that determines cluster membership is running in memory and is still able to send heartbeat packages. But everything else... from my experience the operating system will be in a "frozen" state. How is Serviceguard running on HPVM guests able to detect these issues if the backing storage isn't available anymore? In my test environment no failover occurred, cmviewcl -v showed the package still up and running even though the critical resource, the volume group disks, weren't available anymore.

I'm looking for any thoughts/ideas on how-to improve the Serviceguard package failover configuration in the described scenario.

Thanks,

RSA

asghar_62 · ‎07-24-2012

Simply monitoring each physical disk in a Serviceguard cluster does not provide

adequate monitoring for volumes managed by Veritas Volume Manager from Symantec

(VxVM), or logical volumes managed by HP-UX Logical Volume Manager (LVM),

because a physical volume failure is not always a critical failure that triggers failover (for example, the failure of a mirrored volume is not considered critical).

For this reason, it can be very difficult to determine which physical disks must be

monitored to ensure that a logical volume is functioning properly. The HP Serviceguard Volume Monitor provides a means for effective and persistent monitoring of storage volumes.

The LVM monitoring capability (cmvolmond) is relatively new for Serviceguard A.11.20,

and requires the September 2010 Patch.

cmvolmond replaces cmvxserviced, combining the VxVM monitoring capabilities

of cmvxserviced with the new capabilities needed to support LVM monitoring.

Although cmvxserviced will still work in A.11.20, HP recommends you use

cmvolmond instead.

melvyn burnard · ‎07-24-2012

So if I get this right, you have two HP Integrity Virtual Machines, both are set up as Serviceguard nodes.

Am I correct to assume they are on different Hosts?

I also assume that the Guest OS resides on the SAN infrastructure?

The way I see it, you would need to monitor the storage on teh Integrity Virtual Machine Host, and then pass this down to the Serviceguard nodes.

The challenge is as you state, the SG node has now lost it's backing store so how does this help.

Off the top of my head I would say that you are trying to cover from an MPOF rather than an SPOF, which becomes more challenging.

Therefore I would suggest that the Virtual Machine Host should monitor the SAN storage, and if it ALL goes down on that particular Host, the monitoring should somehow trigger off a TC to the Virtual Machine, FORCING a failover.

This may NOT be a "neat" solution, but it is what I can think of right now.

The other option is, of course, is to put the VM Guest OS disks on internal storage or as files on each Host, which would remove the issue of the Guest OS being able unable to respond to any triggers or events that could be set up inside the Guest OS.

Just my current two pence worth

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

RSA · ‎07-24-2012

Hello Asghar Riahi_1,

I will definitely look into the new LVM monitoring capability (cmvolmond). My Serviceguard environment is running on A.11.20, I will also make sure that I have at least the September 2010 patch installed.

Thanks,

RSA

Quick update after some initial testing with cmvolmond. I added the following lines to my SG package configuration:

service_name                            lvol_monitor
service_cmd                             /usr/sbin/cmvolmond -O /var/adm/cmcluster/log/lvol1.log -D 1 -t 61 /dev/vg_cluster/lvol1
service_restart                         None
service_fail_fast_enabled               yes
service_halt_timeout                    90

Checked and applied the configuration to the package. Monitoring seems to work:

# tail -f /var/adm/cmcluster/log/lvol1.log
Jul 24 21:19:10 Sleeping 61
Jul 24 21:20:11 Sleeping 61

Nevertheless with both fibre connections disabled, monitoring also "freezes" until I re-enable the connections again:

Jul 24 21:27:54 Timed-out 29159, terminating
Jul 24 21:27:54 Terminated cmcheckvx
Jul 24 21:27:55 Sleeping 61

The package didn't switch from the primary node to the secondary node in this case. Maybe some further fine tuning is required.

RSA · ‎07-24-2012

Hello Melvyn Burnard,

Yes, you are correct: two HP Integrity VMs, both are set up as Serviceguard nodes and they run on different hosts. The Guest OS resides on the SAN infrastructure and the disks are passed down from the hosts to the guests. I'm trying to cover a full storage array outage and/or redundant SAN switch outage, which is more than a SPOF and it's getting pretty difficult to monitor this properly.

I also had the idea to run a second Serviceguard installation on the Integrity hosts itself to monitor the disks as well as fibre channel connections. This could be a way to force a failover. Would be a little difficult to manage, but this could be a solution... even though it adds a lot of complexity.

Internal storage for the guest OS would be an alternative too, but this makes it more difficult to move the guest from one host to another in the case of Integrity VM host maintenance.

I'm trying to figure out why the guest OS and Serviceguard is "freezing" and not causing a reboot. Maybe tweaking timeout parameters for PVs and LVs could improve the situation.

Thanks,
RSA

asghar_62 · ‎07-24-2012

RSA,

Are you using the Serviceguard tool kit for Integrity Virtual machines?

It is highly recommended to install the VM guest management software, especially on VM guests functioning as Serviceguard nodes, in order for Serviceguard to determine an optimal io_timeout_extension value (otherwise, Serviceguard would assume the most conservative value of 70 seconds resulting in unnecessarily lengthening the cluster recovery time).

The io_timeout_extension parameter is set internally by Serviceguard and is not configurable by the user; however its value can be viewed using the Serviceguard cmviewconf, cmviewcl –v –f commands, or can be found in the system log file.

For more information regarding the Serviceguard toolkit for Integrity Virtual Machines look at:

http://h20000.www2.hp.com/bizsupport/TechSupport/SupportTaskIndex.jsp?lang=en&cc=us&taskId=101&prodClassId=10008&contentType=SupportManual&docIndexId=64255&prodTypeId=18964&prodSeriesId=5196848

RSA · ‎07-24-2012

Asghar,

In this case I'm not using the Serviceguard toolkit for Integrity Virtual Machines. Serviceguard is protecting the application in this environment, not the guests itself. The Serviceguard nodes are the actual Integrity VM guests. Is there a way to utilize the Serviceguard toolkit for Integrity Virtual Machines in my setup? I thought that's only an option if one setups the Integrity VM guests as Serviceguard packages which is not the case here.

I can confirm that the VM guest management software is installed on both HPVM guests and the version matches to the HPVM host software.

# cmviewcl -v -f line |grep io_timeout
io_timeout_extension=40000000
configured_io_timeout_extension=0

Thanks,

RSA

asghar_62 · ‎07-24-2012

RSA,

An important distinction between VM as Serviceguard package and node configurations is that VM as Serviceguard node configurations only support whole disk VM backing stores. One reason for this restriction is that it is not possible to set timeouts on logical volumes or file systems presented as backing stores to the VM guest, and any errors generated from these types of backing stores are not passed through the virtualization layers from the VM host to VM guest that would allow Serviceguard running in the VM to react to these conditions. Another reason relates to disk I/O performance and the speed at which I/O requests can be completed prior to a VM node failure, which can affect cluster reformation time. Please read the document “Designing high-availability solutions with HP Serviceguard and HP Integrity Virtual Machines”. I provided the link in my previous posting.

RSA · ‎07-24-2012

Asghar,

Yes, you are right: Only whole disks are supported for the VM backing store. I'm sorry if this wasn't clear in my earlier messages. I'm using one whole disk (SAN LUN) for the guest VM operating system and one whole disk (SAN LUN) for the application. I read the document you mentioned and I'm aware of these constraints.

Thanks,

RSA

jim_curtis · ‎07-24-2012

SG does not support nested clustering where you have a cluster of VM hosts and a second cluster of guests on those hosts running packages. The problem as you recognized is the complexity of such a configuration. It would lead to unpredictable failover scenarios as you basically have "2 brains" trying to decide what actions to take but with no real way to communicate between these brains to coordinate actions.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

HPVM guests as Serviceguard nodes- monitoring critical resources

HPVM guests as Serviceguard nodes- monitoring critical resources

Re: HPVM guests as Serviceguard nodes- monitoring critical resources

Re: HPVM guests as Serviceguard nodes- monitoring critical resources

Re: HPVM guests as Serviceguard nodes- monitoring critical resources

Re: HPVM guests as Serviceguard nodes- monitoring critical resources

Re: HPVM guests as Serviceguard nodes- monitoring critical resources

Re: HPVM guests as Serviceguard nodes- monitoring critical resources

Re: HPVM guests as Serviceguard nodes- monitoring critical resources

Re: HPVM guests as Serviceguard nodes- monitoring critical resources

Re: HPVM guests as Serviceguard nodes- monitoring critical resources