- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Re: Unravelling linkloop frame loss
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-10-2005 04:08 AM
06-10-2005 04:08 AM
Unravelling linkloop frame loss
after some recent incidents of total failure of a few production clusters owe to obviuosly not-really-redundant network topologies, I demanded from our network tanglers to be furnished with a dedicated heartbeat lan.
I could successfully verify a decent link through either test frames from linkloop as well as binding an IP address and making tcp/ip connections between two of the cluster nodes.
However, though devising the same procedure and having clearly identified the correct port of the quad port NIC (n.b. the left card seems plugged in the slot upside down what renders a somewhat queer counting sequence), I keep failing getting linkloop frames returned from the third cluster node.
The network admins are doing some arcane VLANing on their switches, where I suspect a malconfiguration.
But they insist that their part is all set and done, and the lit LED on the NIC's port also pretends a valid link.
We tried both fixed mode settings, as well as autoneg on both link partners.
Finally I set it back to
# lanadmin -X auto_on 1
and it agreed upon
# lanadmin -x 1
Current Speed = 100 Full-Duplex Auto-Negotiation-ON
I know for sure to have identified the correct (card) port because when we changed switch ports I got this message from the btlan driver
# grep btlan /var/adm/syslog/syslog.log|tail -1
Jun 10 16:24:48 neptun vmunix: btlan: NOTE: MII Link Status Not OK - Check Cable Connection to Hub/Switch at 0/5/0/0/4/0....
which maps to lan1
# lanscan|grep lan1
0/5/0/0/4/0 0x00306E04F560 1 UP lan1 snap1 2 ETHER Yes 119
I tried to send linkloop frames to either of the cluster nodes that already can tcp/ip one another over the dedicated heartbeat lan.
Because I parsed the MAC rather than typing from these nodes' lanscan (always a peril for typos otherwise), I'm also quite confidend to have addressed the correct remote NIC ports.
# jup_lan10=$(remsh jupiter /usr/sbin/lanscan|awk '/lan10/(print$2}')
# sat_lan10=$(remsh saturn /usr/sbin/lanscan|awk '/lan10/{print$2}')
(they happened to both have lan10 available,
unlike lan1 on this disconnected 3rd node)
# linkloop -n 2 -t 5 -i 1 $jup_lan10 $sat_lan10
Link connectivity to LAN station: 0x001083185998
error: get_msg2 getmsg failed, errno = 4
error: get_msg2 getmsg failed, errno = 4
-- FAILED
frames sent : 2
frames received correctly : 0
reads that timed out : 2
Link connectivity to LAN station: 0x001083187836
error: get_msg2 getmsg failed, errno = 4
error: get_msg2 getmsg failed, errno = 4
-- FAILED
frames sent : 2
frames received correctly : 0
reads that timed out : 2
These errors look definetly different than when I picked a randomly (false) NIC port.
e.g.
# linkloop -i 2 $jup_lan10
Link connectivity to LAN station: 0x001083185998
error: expected primitive 0x30, got DL_ERROR_ACK
dl_error_primitive = 0x2d
dl_errno = 0x04
dl_unix_errno = 57
error - did not receive data part of message
Although I didn't even have a layer 2 link yet I bound an IP address to the NIC, and strangely netstat reports in- and outflow for this NIC
# netstat -I lan1
Name Mtu Network Address Ipkts Opkts
lan1 1500 192.168.20.0 neptun 4 17
So at least some bits went over the wire.
I even tried setting a temporary ARP entry
(how daft this may seem) for the IP address of either saturn (192.168.20.2) or jupiter (192.168.20.1).
But this of course was in vain.
Where could I get further explanation on the error return code 0x04?
Are there other tools to verify the link?
Is there other proof I could obtain to "pass the blame"?
Many thanks
Ralph
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-10-2005 04:29 AM
06-10-2005 04:29 AM
Re: Unravelling linkloop frame loss
I have two suggestions:
1) Open a response center call and they will help you create documentation that proves it. If there is something wrong with the configuration, they can help document it.
2) Use tcpdump or ethereal to take a network dump of whats going on when you get these errors. That will show your network admin's what their configuration is answering to your valid traffic. Or it will give you something to work on on the sysadmin side.
SEP
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-10-2005 04:33 AM
06-10-2005 04:33 AM
Re: Unravelling linkloop frame loss
How close together are the two nodes? I'd be tempted to run a x-over cable between them to 'prove' link level connectivity.
I'm no VLAN expert, but I thought there was a certain type of VLAN tagging which tagged frames based on the source IP address in a IP header... not sure what one of these types of VLAN would do to frames that had *no* IP address (such as a link level packet).
YOu don't explicity state it, but I'm assuming once an IP address was bound to the interfaces you tried to just use ping?
HTH
Duncan
I am an HPE Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-10-2005 04:37 AM
06-10-2005 04:37 AM
Re: Unravelling linkloop frame loss
HTH
Duncan
I am an HPE Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-12-2005 02:03 AM
06-12-2005 02:03 AM
Re: Unravelling linkloop frame loss
many thanks for your support.
Two of the 3 cluster nodes share a rack in one room of our bunker while the third one,
the problematic one, stand in another room
some 50m apart (I hope the wire unwound is still below 100m).
To reassure that I selected the right NIC port, and that neither the wire up to the mistery VLAN switch, nor the server's NIC itself have any damages, I today plugged my laptop p2p in at the end of the wire before the switch where all 3 nodes' wires meet, and assigned it an appropriate IP.
I then could login from my laptop on the problematic node.
What puzzles me a bit is that I didn't use a separat X-over cable, so I gather that some automagic crossing must have taken place.
So it looks as if yet another "intelligent" network component must have been in between.
Could this be a possible candidate?
However, although I knew for sure that it wouldn't work I gave cmcheckconf a try with my new HB LAN.
Here is where it got hickup and denied further service:
Checking cluster file: /etc/cmcluster/clusterconf.ascii
Checking nodes ... Done
Checking existing configuration ...
Done
Gathering configuration information ... Done
Gathering configuration information ... Done
Gathering configuration information ............. Done
Error: Detected a partition of IP subnet 192.168.20.0.
Partition 1
jupiter lan10
saturn lan10
Partition 2
neptun lan1
cmcheckconf : Unable to reconcile configuration file /etc/cmcluster/clusterconf
.ascii
with discovered configuration information.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-12-2005 02:07 AM
06-12-2005 02:07 AM
Re: Unravelling linkloop frame loss
I filed a support case with HP as you suggested, and asked them to have a look at this thread.
Let's see if they have something up their sleeve...