1753894 Members
7277 Online
108809 Solutions
New Discussion юеВ

Infiniband Part 2 !

 
KevB_1
Advisor

Infiniband Part 2 !

Thanks to Tim I now have ib interfaces on my blade servers

If I start a subnet manager on one of then I can ping them both so happy with that part !

My issue is that I have to connect them into another switch in an exadata rack.

So from the HP BLc 4X QDR IB Switch in the enclosure I have cables connected to the switch in the Exadata.

These do not show an active link at either end

Do I have to enable something on the switch for it to become active ?

This is the first time dealing with IB and also blade enclosures so sorry if this is a noddy question !
9 REPLIES 9
Tim Nelson
Honored Contributor

Re: Infiniband Part 2 !

If I am reading this right...sorry if I am shooting blind..

My IB connections did not come ready/active link until I loaded the drivers and configured the interfaces.

e.g. either manually looking at the lights or looking at the switch port status, there was no link lights until I configured the HCA from the OS.

you can review the HCA status by either using your ibstat diags or by cat /sys/class/infiniband/mlx4_0/ports/*/state



KevB_1
Advisor

Re: Infiniband Part 2 !

Tim

Sorry havent got back sooner!

Found issue was that when someone built the hardware for me they decided to put the cables in the switch upside down!

Ok so next problem !!

When I start up the IB on blade servers it causes the subnet manager to die on the switch with a mem segfault - but can run the subnet manager on one of the servers and it is all ok ?

Would ideally like to sort this out

Problem started when I moved cables from the blade switch to other switches in the exadata for resilience

I have powered off/on all the switches involved

Now it is getting annoying !

Tim Nelson
Honored Contributor

Re: Infiniband Part 2 !

I do not have the answer for you but just some thoughts.

I initially was testing using a server as the subnet manager. I then realized that if the server went down the whole IB network would stop.. not sure why anyone would want that.

so I have my switches be the subnet manager ( voltaire 2046s. and do not let the server(s) start a subnet manager. disable with chkconfig opensm (i think)

I would not think that starting a SM on a server connected to switches that already have a master / slave SM running would effect them but you never know.

these are just thoughts.. and may only lead you to a solution..


rick jones
Honored Contributor

Re: Infiniband Part 2 !

"When I start up the IB on blade servers it causes the subnet manager to die on the switch with a mem segfault - but can run the subnet manager on one of the servers and it is all ok ?"

Is this the switch in the Exadata rack, or the HP switch?

Regardless, perhaps a tangent, but in my opinion, if indeed the subnet manager on the switch dies with a mem segfault, you should go ahead and exercise your support contract and get a defect filed. Of course, the first question out of the support folks will probably be to ask if the switch is running the latest bits...
there is no rest for the wicked yet the virtuous have no pillows
KevB_1
Advisor

Re: Infiniband Part 2 !

Tim

I dont start the SM on the server it is as soon as IB runs on the blades server it kills the SM on all 3 of the exadata switches so IB goes down so have to run it up on server to get the fabric up.

If I take the IB down on the 2 blade servers the SM on the switches springs to life.

So far I have tried 3 versions on ib on the blades and all of them cause the issue so at least it is consistent.

Rick

Dont think it is an issue with exadata unfortuantely as it works ok when the blades are not running IB


Problem seems to have happened after I moved 2 of the blade cables from one of the exadata switches to put 1 in each of the other 2 switches in the exadata.

Is there some sort of arp/routing table that needs resetting on the switch ?
rick jones
Honored Contributor

Re: Infiniband Part 2 !

I don't know much about the switches here, but on first principles, a bit of software in one device (eg switch) should not segfault if another device is connected. That sounds like a recipe for a denial of service.

That said, I could see where say dueling subnet managers might cause one or another to decide to take themselves offline, but I see that as being very different from going down with a segfault.
there is no rest for the wicked yet the virtuous have no pillows
Tim Nelson
Honored Contributor

Re: Infiniband Part 2 !

I don't have enough experience on these to come up with any other ideas.. sorry..

maybe check out the doc on the switches and exadata config ?

KevB_1
Advisor

Re: Infiniband Part 2 !

Just to close this.

found this util on one of the exadata servers - /opt/oracle.SupportTools/ibdiagtools/verify-topology

when I ran this it showed an error that I had 2 external switches connected together ie the blade switch and the spine switch in the exadata

I disconnected this connection and hey presto SM on the switches now working !!

Thanks for all your input
rick jones
Honored Contributor

Re: Infiniband Part 2 !

You mean there was a loop?
there is no rest for the wicked yet the virtuous have no pillows