DMA Read Time-out / System crash with ES80

mkrauss · ‎01-22-2012

Hi !

we are facing a strange problem with an ES80 running OpenVMS 7.3-2. The system consists of 3 CPU drawers. Each of the drawers holds two CPUs, 4GB RAM and some I/O cards:

0: FGA: KGPSA-** (Emulex LP9802)

EWA: DEGXA-SB

GHA: ATI Radeon 7500

1: FGB: KGPSA-** (Emulex LP9802)

FWA: FDDI PDQ

2: EWB: DEGXA-SB

Please see attachment for exact system configuration.

The system runs fine when running as standby machine. About 15 to 45 minutes after becomming the master system and taking load (Oracle database+application), it crashes. Analyzing the binary errlog.sys using HP SEA 5.5 gives the following error: "The DMA Read Time-out has Occurred "

The recommended action is:

"Each IO7 South Port has a timer value assigned to control DMA Read
transaction(s). The IO7 will attempt to complete the transaction
until the timer expires. In this case, the timer has expired;
the transaction is discarded and a uncorrectable error condition
is signaled.
The reported condition below indicates which south port initiated the
DMA transaction, however since this was a DMA Read, the problem is
most likely a RBOX mesh or a memory problem. Check the RBOX registers
for error conditions.
The other cause for this condition is; the IO Adapter initiated the DMA
Read, then disconnected from the PCI Bus, and never returned for the data.
Reported Condition: No Bus Master but,
South Port 1 initiated an UnCorrectable Interrupt.
Cabinet 0 - 2P - Drawer 1 "

We have swaped drawer 1 (not the CPU/MEM module) and the FWA FDDI Adapter, but the problem still persists.

My problem is to understand what triggers the root cause of the crash.

Since SEA states, that South Port 1 initiated the uncorrectable error during a DMA transfer, my understanding is, that a device initiates a DMA transfer but never gets back to the bus to complete the transfer.

Does the message point to a device located in Drawer 1 or in Drawer 2 ?

More chrash-Information can be found in 20111220_show-crash.pdf.

Is it possible that this may be a software problem ?

Any Help would be greatly appreciated.

BR,

Michael

Volker Halle · ‎01-27-2012

Michael,

CPU 02, which is declaring the Machine Check, is in Drawer 1 (if you start counting with Drawer 0). The WEBES/SEA diagnosis implicates South Port 1:

Detect_SP[47:40] x2 South Port 1 Initiated Interrupt

The Errlog Entry even has the PCIX Bus Configuration entry and what's in there: 0x000F1011

A quick search in SYS$SYSTEM:SYS$CONFIG.DAT identifies device this as: "DEFPA (FDDI)"

Interesting case, isn't it ?

Volker.

Volker Halle · ‎02-03-2012

Michael,

the 3X-DEFPA-MC is constantly logging hardware errors in the same way, even if the system is in stand-by mode and after moving the DEFPA from drawer 1 into drawer 2:

SDA> SHOW LAN/FULL/DEV=FWA shows:

...

Fatal error count                  5    Last error CSR              80000400
Fatal error code        9-HardwreErr    Last fatal error      3-FEB 07:40:02
Prev error code        9-HardwreErr    Prev fatal error      2-FEB 13:19:48

...

Next step is to swap the DEFPA with a 'new' one...

Volker.

Andy Bustamante · ‎02-06-2012

If memory serves, there's a configuration guideline that says do not put a KGPSA in the same bus as a graphics card. This was documented with the KGPSA information. This link http://h30499.www3.hp.com/t5/Hardware/KGPSA-BC-and-S3Trio-Interaction/m-p/3488487#M1189 points out systems with less than 4 GB of memory may run without issues, but this would be an unsupported configuration.

Make sure your firmware is current and check current configuration guidelines. Reconfigure the system.

If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net

Volker Halle · ‎02-09-2012

Michael,

after putting a 'new' DEFPA into the original position in drawer 1, PCI bus 1, the same crash happened again. Again preceeded by a hardware error logged on the DEFPA, it took 15 of those 9-HardwreErr until the system finally crashed.

Andy,

thanks for the feedback, but it's neither a S3 Trio graphics adapter nor is the ATI Radeon 7500 on the same PCI bus as the KGPSA. If the same problem would apply to the DEFPA, note that the DEFPA is on it's own PCI bus in drawer 1 and has also been in another PCI bus in drawer 2, in which there is no KGPSA at all.

Volker.

Volker Halle · ‎02-10-2012

Analysis of the most recent MACHINECHK crashdump showed, that there were a couple of FWA Receive Errors (from SDA> LAN TRACE/DEV=FWA) preceeding the DEFPA reset preceeding the crash.

The other sites with similar/same configs running without problems seem to have no errors in the FDDI ring at all. The site with the MACHINECHK crashes on one of the 2 identical systems sees lots of errors in the FDDI ring.

So there might be a generic problem in the DEFPA, which is being triggered by the more frequent resets caused by the high error rate in the FDDI ring (high numbers of MAC errors and Receive Errors).

Update 19-FEB-2012: further analysis has shown, that the Link Error Estimate VALUE of the FDDI ring at this site is quite low: actually 9 at the system with the highly reproducable MACHINECHK crashes and 10-12 on the other system at this site. Data from another site with a comparable configuration shows 15. That value indicates, that the FDDI error rate at the problematic system is about 1,000,000 times higher as at the other (good) site. It has also been noted that those sites have FDDI concentrators from different manufacturers. The 'good' site seems to have a FDDI Concentrator made by Digital.

The Link Error Estimate value X specifies the error rate as 10^-X errors per seconds.

Volker.

Volker Halle · ‎02-20-2012

While researching this problem and trying to determine the source of the many CRC errors on the FDDI ring seen on this node, a day-zero bug in [LAN]PDQ.MAR has been found, causing a wrong 'source address' to be displayed:

Last CRC srcadr AA-03-00-00-00-08

This address shown is really some byte string inside the LLC data of the FDDI frame and NOT the FDDI$G_SA source address of the frame. A register was incorrectly being incremented by 7 (PDQ_C_RCV_HDR_SIZE) TWICE, when reporting CRC errors.

If required, this bug can worked around with PATCH/ABSOLUTE. The problem has been reported to OpenVMS LAN engineering.

Volker.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

DMA Read Time-out / System crash with ES80

DMA Read Time-out / System crash with ES80

Re: DMA Read Time-out / System crash with ES80

Re: DMA Read Time-out / System crash with ES80

Re: DMA Read Time-out / System crash with ES80

Re: DMA Read Time-out / System crash with ES80

Re: DMA Read Time-out / System crash with ES80

Re: DMA Read Time-out / System crash with ES80