Operating System - HP-UX
1854689 Members
13845 Online
104102 Solutions
New Discussion

Rough weekend: Looking for diagnostic help

 
Steven E. Protter
Exalted Contributor

Rough weekend: Looking for diagnostic help

Timeline:
9:56 a.m. Saturday morning, D380/2 drops of the network according to snmp. Its the only box being monitored but it came back 4 seconds later according to snmp. Thats the time it was pingable until 10 a.m. this morning.

This boxes console locked up and had to be reset with the old power off hold down the d at power up trick.

There was nothing useful in the syslog.log or OLDsyslog.log on any of my 4 systems.

At 2:00 a.m. Monday all Veritas backups failed on a network timeout. The L class boxes remained on the network and were user accessible on Monday morning.

At 9:30 a.m. Saturday morning, the building next door cut its own power due to a construction accident.

Our switches so no record of an event or power cycle.

I have run mstm excercize on all relavent hardware and it checks out perfectly fine. Maybe I have a flakey fiber card on one box.

So.

Given the same set of circumstances, what would you do next?

It seems obvious there was a power problem but no Windows servers were affected, only my serial console. Because of the nature of the problem next door, nobody believes there was a voltage drop or surge in this building.

I'm, stumped.

What would you do next.

Points awarded for all suggestions, even if I already tried it an forgot to post it.

Bunnies for the step or steps that unravels this mystery.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
9 REPLIES 9
Jean-Luc Oudart
Honored Contributor

Re: Rough weekend: Looking for diagnostic help

Hi SEP,

1)I suppose all your equipment on UPS.
Are all on same UPS ?

2) "Monday all Veritas backups failed on a network timeout"
What is the defined timeout ?

Are sure the 2 events are related or is it speculation ?

Regards,
Jean-Luc
fiat lux
D Block 2
Respected Contributor

Re: Rough weekend: Looking for diagnostic help

H'mm not certain why the console would hang. if this does occur again, hit the TOC button. There might be a network hang on this D-class. If the D-class decides to reset the network, this might cause the entire system to hang for 5 minutes or so, reason why you ping was delayed.

Also, I would check on the Fibre Card types and see if there is a newer FC patch. I would do some analysis on the Netbackup Logs, to see if there is any hangs on media or timeouts.
Golf is a Good Walk Spoiled, Mark Twain.
Steven E. Protter
Exalted Contributor

Re: Rough weekend: Looking for diagnostic help

Good suggestions.

Can't do TOC, because its a production server. It is possible someone was fiddling around and hit that button, but it would not have effected other servers.

Jean:
1) The equipment is on different UPS, the D380's UPS may have a little more hardware on it than the UPS is rated to handle.

2) Network timeout is defined as Veritas trying to back up for 5 minutes. That is user defined. We believe the problem is related. The Console is on building power. The one next to it was rendered useless by me swittching the L2000 boxes to web consoles.

The L2000 boxes were off the network as far as Veritas was concerned but were user accessible by users.

Seem's like we're headed to an unsolvable head scratcher. I have the Veritas man and the Network man trolling for more data.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Marvin Strong
Honored Contributor

Re: Rough weekend: Looking for diagnostic help

This may sound crazy. But did you.

Ctrl-B (I think) on the console and check for errors in there? On each of your servers.
Been awhile since I have been on a D server but I think you can still do that.

There could be something in those logs.

also check the master and client logs for veritas, sometimes there is some useful info hidden in with the junk.

In my experience with netbackup if you got error 54 I think thats one network timeout error. Seems to just be a catchall.
I had problems with that error even when the network had nothing to do it.

Not sure how your setup there, which server is your backup master. maybe recycle the netbackup daemons on all your servers.
Steven E. Protter
Exalted Contributor

Re: Rough weekend: Looking for diagnostic help

D Class boxes do not have GSP.

I am checking the others.

Bunny potential.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Steven E. Protter
Exalted Contributor

Re: Rough weekend: Looking for diagnostic help

GSP shows nada. No events. Still a great idea, I fogot to check that.

UPS worked at least on the L boxes.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Chris Vail
Honored Contributor

Re: Rough weekend: Looking for diagnostic help

Steve---
We had a similar unexplainable incident a few years ago on a V2600. We eventually traced it to (or at least blamed it on) the security people. They did a full port scan against the production network (without telling anyone) and completely hosed it. The V2600 dumped core and crashed. Of course they blamed us for having downrev/unpatched systems, but that is just finger-pointing. Check the syslog again for broken pipes and buffer overflows. Even one or two might indicate some sort of attack--even if unintentional.


Chris
Geno Church_1
Valued Contributor

Re: Rough weekend: Looking for diagnostic help

SEP...

On all your HP boxes that were it with this network timeout, did the nettl log show anything? The nettl log is located in /var/adm and is called nettl.LOG000....To read this simply do a netfmt -t 10 -v nettl.LOG000....see if there is any useful info there about the disconnects....I would see if there are time matches between the servers to see if you can pinpoint an exact time that this occured...Just grasping for straws;)

Geno
Steven E. Protter
Exalted Contributor

Re: Rough weekend: Looking for diagnostic help

All ideas will be pursued.

The box in question had a 12 month freeze on software upgrade in hopes that it would be pulled from production. I'm stuck with it for another year and patched it to December 2003 a week ago and will be upgrading applications and setting up EMS traps to get more data.

I think its unlikely anyone ran a portscan, we're pretty much a Sabbath observant Jewish business. My manager was on premesis just after the initial event but insists nothing was touched and what he and his crew were doing could not have caused a problem.

I have no evidence this time, but will set up for evidence next time. Points will be awarded shortly.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com