1848585 Members
2011 Online
104033 Solutions
New Discussion

2 Part Question

 
SOLVED
Go to solution
Jason Berendsen
Regular Advisor

2 Part Question

I have an N class server running HP-UX 11.0 and MC ServiceGuard. For the second time last night the server locked up during the Omniback backup session. When I say locked up I was able to ping the server, but was unable to log in, both remotely and from console. Both times this has happened Omniback was trying to backup /var when it locked up. I receive no complaints from ITO about lack of communication with the ITO control agent. Nothing in any log shows any problems. The only clue I have to what could be the problem is the disk drive lights on the front of the server are not in sync like they should be on a mirrored system. Is this possibly a intermittent problem with a drive associated with vg00? Any other ideas as to what would cause this situation.

Question 2:
For some reason I have two logs in my cluster directory. I have the ctl.log and something called a pkg.ctl.log. The ctl.log is where I see my starting and stopping of the cluster. This log didn't show any problems either time this server had it's failure. Yet the pkg.ctl.log has the most current timestamp. This log only holds the following information over and over:
Enabling package switching on clms-mgt
cmmodpkg : Warning: Package clms-mgt is already able to be switched.
cmmodpkg : Completed successfully on all packages specified.

Any idea what this secondary log is and what it is used for?

Thanks,

Jason
8 REPLIES 8
LucianoCarvalho
Respected Contributor
Solution

Re: 2 Part Question

Hi,

Part 1
I've seen this situation many times, and almost all cases the problem was disk failure, not necessarily on VG00 it could be in another.
Look at /var/adm/syslog/OLDsyslog.log and /var/adm/syslog/syslog.log and try to find messages like:
scsi reset or
scsi timeout or
power failed

Part 2
I think this could something about the switch flag for the package. The system tried to enable the switchbut it's lready enabled. Maybe it's normal. Can you attach the file pkg.ctl.log??

regards
Jason Berendsen
Regular Advisor

Re: 2 Part Question

Luciano,

Already had checked the previous syslog and the current, neither had any entries that would be associated with either a powerfailure or a failing drive.
LucianoCarvalho
Respected Contributor

Re: 2 Part Question

Ok Jason.

So the locked could have been caused by a high load during omniback session. Next time it happens, you can try to find out what part of the system is having high utilization with tools like.(if you already have a log session opened)
#vmstat
#top
#sar
#glance

Regards

Ron Cornwell
Trusted Contributor

Re: 2 Part Question

Was the backup still incrementing? Did you reboot the system or did it clear up after the backup? Do you have diagnostics/ Predictive/ ISEE installed? If so look through there for HW errors.
Jason Berendsen
Regular Advisor

Re: 2 Part Question

Ron,

The backup of the /var filesystem hung at around 01:00AM and finally timed out at 05:00AM. Nothing was backed up between these times. We do have diags set up, but I see no indications from them of hardware error.
John Poff
Honored Contributor

Re: 2 Part Question

Hi Jason,

I've never seen a ctl.log file before. Does it have a current timestamp? What command are you using to start and stop your cluster? Are the stop and start times for the cluster in that ctl.log just when you do it manually, or at boot time, or both?

As for the hangup, maybe it isn't a disk error. It could be that somebody or something has written an astronomical number of files in a directory somewhere under /var, and Omniback is choking when it tries to figure out what to backup. You could cd to /var and do a 'find . | wc -l' to see how many files are there. Otherwise, I would suspect a bad disk, and it could be the worse kind; one that is bad enough to cause problems but not bad enough to completely fail.

JP
Darren Prior
Honored Contributor

Re: 2 Part Question

Hi Jason,

I'm also confused about the ctl.log. Are you still seeing the cluster start/stop stuff in syslog, or has someone perhaps redirected it to this ctl.log? Have a look at your /etc/syslog.conf.

I'd also expect the package control log to contain the package name (ie clms-mgt.cntl.log), and for it to be in the /etc/cmcluster/ directory.

Can you post the revision of ServiceGuard you are running, and the revision of the SG patch you have?

Going back to your 1st question - have any changes been made to the system recently that could have impacted Omniback?

regards,

Darren
Calm down. It's only ones and zeros...
Decio Miname
Frequent Advisor

Re: 2 Part Question

Q1: Lack of available memory is another *one* possibility that use to hang servers as you describe. A monitoring script would help you know what was going on when the server hung.
Q2: Have you already checked MCSG's configuration files? One guess: if you had one pkg log file for each package to save specific package logs, they would be easily readable since other packages' logs and system logs are not mixed up in a single file. Maybe that's your case?