ProLiant Servers (ML,DL,SL)
1748122 Members
3311 Online
108758 Solutions
New Discussion юеВ

Re: ML370 G6 Memory Configuration Error external graphic card

 
Sanatorium
Occasional Visitor

ML370 G6 Memory Configuration Error external graphic card

Hello,

in relation to deep learning research, my company which is collaborating with my university has several ML370 G6 / DL370 G6, which were left for that matter. For computations we used several nvidia quadro 4000, nvidia quadro 6000 and nvdia tesla p100. As there is no place left, we purchased 1x to 16x riser adapters to attach 5 external graphic cards. We wanted to use all of the 9 pci-express slots, but everytime the sixth card is inserted, the server won't post and insight display shows an error on memory (all leds amber for memory). ILO2 logs "memory configuration error. System boot halted" . 

We tried different permutations on the server configuration without success:

- one processor with 2R 2Gbyte RDIMM / 3x 2R 2Gbyte RDIMM / 2x 2R 4Gybte RDIMM, starting with white slots in order A B C ...

- disabled virtual and embedded serial ports to free IRQs

- force PCIe Gen 1.0 and then Pcie Gen 2.0

Does the ML370 G6 only support 6 Pcie Devices ? We updated to latest System ROM (2015), Is it possible to get a detailed error log through ilo2 rather than the "memory configuration error" (no beep sounds) ?

3 REPLIES 3
someguy123
New Member

Re: ML370 G6 Memory Configuration Error external graphic card

I have the exact same problem on a dl370 (same as ml370) g6.  When it 1st happened, all the dimms lit up, turns out the 1st stick went bad in one of the processors.  Ever since I've been running only 1 processor w/3 sticks (one on each white dimm).  Took a while to figure out, but as I added the cards one by one to the risers, the 6th one caused issues.  I replaced riser, tried different cards, removed all hardware not in use in the bios (thinking there was some IRQ conflict).  Basically once it happens it stays dead until I clear the nvram.

Problem goes like this:

6th gpu added, >  server won't post, no splash screen nothing.

1st reset, fans go full speed and stay that way forever.

I'm still stumped, do you know what ILO shows on your machine?  I haven't bothered configuring it.  Also, if I run both processors, does that open up more PCI lanes?  I had the same problem w/2 processors.  Apperently the flashing red light I also get is indicative of Power supply issue.  But this is never  a problem when running 5 cards (I"m running external PSU for the other 3).

I might test 6x pci controllers to see if it's PCI specific and not GPU-specific.  Why have 9 slots if it would cause issues I wonder?

Sanatorium
Occasional Visitor

Re: ML370 G6 Memory Configuration Error external graphic card

Well, we have also some ML350G6 which seems to have the same error. As we thought that it could be a memory address allocation shortage of the pcie devices we also added 32Gb ram - no success.

Moreover, we added a pci-e switch, which 'masks' the connected video cards, so bios would only show the switch as a pci device. Unfortunately, that did not help at all. (we got some different brands of pcie multiplexers, which we want to test soon)

Older ML370 G5 are capable to boot with 6 Pci Express video devices but fail with 7. Out of curiosity we added an old r9 280x consumer graphics card as the 7th video and it booted correctly. Nevertheless, any other consumer card like nvidia gtx 970 or gtx 1060 failed as a 7th card. If you look up details about the FASTRA II, which was also created for gpgpu, they run into difficulties with the bios, which had to be altered for more than 6 gpu's. And yes, it is tedious that you have to clear nvram after this memory error happens. There is a site which mentions the dl370 g6 with video / pcie devices up to 5. https://documentation.vizrt.com/viz-engine-guide/3.7/hp_dl370_installations.html

But that's it. It really could be interesting if this error is only graphics card related and any other combination of raid / ethernet controller is just fine with up to 9 devices. Then I would assume is has to be something with the memory address allocation of the video cards external gddr5 ram

Sanatorium
Occasional Visitor

Re: ML370 G6 Memory Configuration Error external graphic card

P.S. : ILO2 shows only "memory configuration error" but without any sound, which occurs when there is really something wrong with the ram.

I don't think there is an extra processors needed to get additional lanes, as I think that both proc's first have to communicate with the 5200 chipset. I could be wrong, but in the documents I don't see a direct pcie connection in contrast to the latest generation of intel processors which have it. regards