Operating System - HP-UX
1850257 Members
3446 Online
104053 Solutions
New Discussion

superdome partitions down

 
SOLVED
Go to solution
siva0123
Trusted Contributor

superdome partitions down

Hi,

I'm having a HP9000 superdome with 3 partitions.

Suddenly two of the partition has gone down and one of the partitions console log had this following error.


**********************Auto-Port Aggregation/9000 Networking*****************@#%
Thu Jan 25 GMT 2007 12:38:43.547231 DISASTER Subsys:HP_APA Loc:00000
<1006> HP Auto-Port Aggregation product found that ports in failover
group lan901 are no longer connected to each other. Port 2 did
not receive any poll packets.

(N)ext or , (P)revious, ^B to exit to menu

And strangely i'm not able to take control f the partitiond thru GSP also.

I reseted one of the partition but it stops at this point.

"Start host agent" .


Anyhelp is appreciated.

Thanks,
Siva



6 REPLIES 6
HGN
Honored Contributor

Re: superdome partitions down

Hi

I think you want to boot the system in single user mode, comment or rename APA startup script(/sbin/init.d/hpapa), once the server is up you want to access the server thro the console and try to start the APA script and find out why it is failing.You might laso want to look at the 2 apa scripts in /etc/rc.config.d
hp_apaconf
hp_apaportconf

Rgds

HGN
siva0123
Trusted Contributor

Re: superdome partitions down

Hi all,

The partitions are up and infact there was a network (Switch )problem but that doesnt solve all my woes.

When i was not able to connect to the partitions i tried to log into them using GSP.
1. But strangely i was able to login intio only one partition and i didnt get the Console login for other two partitions.

2. The VFP showed all three partitions having heartbeat

3. parstatus showed all three partitions as active

4. Thinking that the partitions have hung i resetted one of the partition (I used RS and not TOC , so no dumps available) it unusually stuck at ane point saying "Staring host agent" but the machine came up after almost 40 minutes by which time the network problem was resolved. Interestingly the host agent service was the last one started in the boot process which i was ale to see from rc.log

5. My worry is if there is a network problem how can GSP cannot allow me loggin into two of the partitions.

6.What are all the possibilities which prevents from logging into the partitions from GSP.

7. Is there any relation between this GSP problem and the APA errors logged?

Lanconfig file:
===============
--> /etc/lanmon/lanconfig.ascii:

# ********************************************************
# *********** LAN MONITOR CONFIGURATION FILE *************
# *** For complete details about the parameters and how **
# *** to set them, consult the lanqueryconf(1m) manpage **
# *** or your manual. **
# ********************************************************

NODE_NAME taapup01

POLLING_INTERVAL 10000000
DEAD_COUNT 3

FAILOVER_GROUP lan900
STATIONARY_IP 10.179.3.102
PRIMARY lan0 5
STANDBY lan1 3

FAILOVER_GROUP lan901
STATIONARY_IP 10.179.1.102
PRIMARY lan2 5
STANDBY lan3 3

APA Statistics:
===============


LAN INTERFACE STATUS DISPLAY
Fri, Jan 26,2007 07:58:58

PPA Number = 900
Description = lan900 Hewlett-Packard LinkAggregate Interface
Type (value) = ethernet-csmacd(6)
MTU Size = 1500
Speed = 100000000
Station Address = 0x306e4a54cc
Administration Status (value) = up(1)
Operation Status (value) = up(1)
Last Change = 7714

Press to continue
Inbound Octets = 2147877465
Inbound Unicast Packets = 3904252297
Inbound Non-Unicast Packets = 124049569
Inbound Discards = 22366
Inbound Errors = 0
Inbound Unknown Protocols = 740649
Outbound Octets = 1529491229
Outbound Unicast Packets = 4183298218
Outbound Non-Unicast Packets = 16527
Outbound Discards = 0
Outbound Errors = 0
Outbound Unknown Protocols = 0
Specific = 0

LAN INTERFACE STATUS DISPLAY
Fri, Jan 26,2007 07:58:58

PPA Number = 901
Description = lan901 Hewlett-Packard LinkAggregate Interface
Type (value) = ethernet-csmacd(6)
MTU Size = 1500
Speed = 100000000
Station Address = 0x306e2d2a8f
Administration Status (value) = up(1)
Operation Status (value) = up(1)
Last Change = 7733

Press to continue
Inbound Octets = 1202288746
Inbound Unicast Packets = 4088047809
Inbound Non-Unicast Packets = 1236824993
Inbound Discards = 720235378
Inbound Errors = 64
Inbound Unknown Protocols = 1095399
Outbound Octets = 2734665484
Outbound Unicast Packets = 4260250021
Outbound Non-Unicast Packets = 1020856
Outbound Discards = 0
Outbound Errors = 0
Outbound Unknown Protocols = 0
Specific = 0



Please note the second PPA is showing 64 inbound errors.

Any suggestions what might have caused this GSP issue.

Thanks,
Siva




Steven E. Protter
Exalted Contributor

Re: superdome partitions down

Shalom,

The GSP may need to be reset or could have been impacted by the switch issue.

Most GSP's will show up on cstm and can be tested for hardware problems.

I think a total reset and hardware test on the GSP should be sufficient. If the GSP is flakey or fails the hardware test, have it replaced.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
siva0123
Trusted Contributor

Re: superdome partitions down

Steve,

The GSP was not frozen or hung, infact i was log into the GSP , do all the things like collecting the logs , viewing vfp , doing everything , even it allowed me console login for one of the partition , the thing is it it didnt give me the console login prompt for only two partitions.

If it was a switch problem or whatever it is it should not have allowed console login for any of the partition right?

And strangely it is giving me the console login prompt after the switch issue was resolved.

Definitely there seems to be a realtion with the Switch issue.. But how to relate it ? That too with GSP which i hope never uses the external network interfaces connected to the switch to collect information of the partition.


Aah!!! . now i have i a doubt?

1. I believe GSP doesnt use the network interfaces to collect the information about the partition but

2. How do the GSP allow login to the partitions? IS it through the network interfaces which are normally used to connect to the partitions or is there any other funda involved?

Thanks,
Siva

David Child_1
Honored Contributor
Solution

Re: superdome partitions down

The GSP communicates to the partition thru a dedicated communcations channel. It does not use the interfaces on the servers.

I have come across a similar issue before. When the switch had a problem, something on those two partitions may have been causing the OS to become unresponsive. If the OS is unresponsive, the console will appear unresponsive. This would explain why the VFP indicated there was an OS heartbeat.

The trick is to determine the cause. First you would want to determine if the OS was starved for memory, etc.

Since you don't have a crash dump to work with, try checking any historical performance data using 'extract' (if measureware is installed and was running) or 'sar' (if sar is constantly collecting data).

David
siva0123
Trusted Contributor

Re: superdome partitions down

David,

at that time , the dmesg on one of the partition was reporting /tmp as full.

I would look
into the memory and other resource issue.

Thanks for guiding me.


Thanks,
Siva