- Community Home
- >
- Servers and Operating Systems
- >
- Legacy
- >
- Alpha Servers
- >
- DMA Read Time-out / System crash with ES80
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-22-2012 06:22 PM - edited 01-23-2012 03:22 AM
01-22-2012 06:22 PM - edited 01-23-2012 03:22 AM
DMA Read Time-out / System crash with ES80
Hi !
we are facing a strange problem with an ES80 running OpenVMS 7.3-2. The system consists of 3 CPU drawers. Each of the drawers holds two CPUs, 4GB RAM and some I/O cards:
0: FGA: KGPSA-** (Emulex LP9802)
EWA: DEGXA-SB
GHA: ATI Radeon 7500
1: FGB: KGPSA-** (Emulex LP9802)
FWA: FDDI PDQ
2: EWB: DEGXA-SB
Please see attachment for exact system configuration.
The system runs fine when running as standby machine. About 15 to 45 minutes after becomming the master system and taking load (Oracle database+application), it crashes. Analyzing the binary errlog.sys using HP SEA 5.5 gives the following error: "The DMA Read Time-out has Occurred "
The recommended action is:
"Each IO7 South Port has a timer value assigned to control DMA Read
transaction(s). The IO7 will attempt to complete the transaction
until the timer expires. In this case, the timer has expired;
the transaction is discarded and a uncorrectable error condition
is signaled.
The reported condition below indicates which south port initiated the
DMA transaction, however since this was a DMA Read, the problem is
most likely a RBOX mesh or a memory problem. Check the RBOX registers
for error conditions.
The other cause for this condition is; the IO Adapter initiated the DMA
Read, then disconnected from the PCI Bus, and never returned for the data.
Reported Condition: No Bus Master but,
South Port 1 initiated an UnCorrectable Interrupt.
Cabinet 0 - 2P - Drawer 1 "
We have swaped drawer 1 (not the CPU/MEM module) and the FWA FDDI Adapter, but the problem still persists.
My problem is to understand what triggers the root cause of the crash.
Since SEA states, that South Port 1 initiated the uncorrectable error during a DMA transfer, my understanding is, that a device initiates a DMA transfer but never gets back to the bus to complete the transfer.
Does the message point to a device located in Drawer 1 or in Drawer 2 ?
More chrash-Information can be found in 20111220_show-crash.pdf.
Is it possible that this may be a software problem ?
Any Help would be greatly appreciated.
BR,
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-27-2012 05:40 AM
01-27-2012 05:40 AM
Re: DMA Read Time-out / System crash with ES80
Michael,
CPU 02, which is declaring the Machine Check, is in Drawer 1 (if you start counting with Drawer 0). The WEBES/SEA diagnosis implicates South Port 1:
Detect_SP[47:40] x2 South Port 1 Initiated Interrupt
The Errlog Entry even has the PCIX Bus Configuration entry and what's in there: 0x000F1011
A quick search in SYS$SYSTEM:SYS$CONFIG.DAT identifies device this as: "DEFPA (FDDI)"
Interesting case, isn't it ?
Volker.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-03-2012 06:01 AM
02-03-2012 06:01 AM
Re: DMA Read Time-out / System crash with ES80
Michael,
the 3X-DEFPA-MC is constantly logging hardware errors in the same way, even if the system is in stand-by mode and after moving the DEFPA from drawer 1 into drawer 2:
SDA> SHOW LAN/FULL/DEV=FWA shows:
...
Fatal error count 5 Last error CSR 80000400
Fatal error code 9-HardwreErr Last fatal error 3-FEB 07:40:02
Prev error code 9-HardwreErr Prev fatal error 2-FEB 13:19:48
...
Next step is to swap the DEFPA with a 'new' one...
Volker.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-06-2012 02:29 PM
02-06-2012 02:29 PM
Re: DMA Read Time-out / System crash with ES80
If memory serves, there's a configuration guideline that says do not put a KGPSA in the same bus as a graphics card. This was documented with the KGPSA information. This link http://h30499.www3.hp.com/t5/Hardware/KGPSA-BC-and-S3Trio-Interaction/m-p/3488487#M1189 points out systems with less than 4 GB of memory may run without issues, but this would be an unsupported configuration.
Make sure your firmware is current and check current configuration guidelines. Reconfigure the system.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-09-2012 12:33 AM - edited 02-09-2012 12:42 AM
02-09-2012 12:33 AM - edited 02-09-2012 12:42 AM
Re: DMA Read Time-out / System crash with ES80
Michael,
after putting a 'new' DEFPA into the original position in drawer 1, PCI bus 1, the same crash happened again. Again preceeded by a hardware error logged on the DEFPA, it took 15 of those 9-HardwreErr until the system finally crashed.
Andy,
thanks for the feedback, but it's neither a S3 Trio graphics adapter nor is the ATI Radeon 7500 on the same PCI bus as the KGPSA. If the same problem would apply to the DEFPA, note that the DEFPA is on it's own PCI bus in drawer 1 and has also been in another PCI bus in drawer 2, in which there is no KGPSA at all.
Volker.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-10-2012 04:15 AM - edited 02-19-2012 04:21 AM
02-10-2012 04:15 AM - edited 02-19-2012 04:21 AM
Re: DMA Read Time-out / System crash with ES80
Analysis of the most recent MACHINECHK crashdump showed, that there were a couple of FWA Receive Errors (from SDA> LAN TRACE/DEV=FWA) preceeding the DEFPA reset preceeding the crash.
The other sites with similar/same configs running without problems seem to have no errors in the FDDI ring at all. The site with the MACHINECHK crashes on one of the 2 identical systems sees lots of errors in the FDDI ring.
So there might be a generic problem in the DEFPA, which is being triggered by the more frequent resets caused by the high error rate in the FDDI ring (high numbers of MAC errors and Receive Errors).
Update 19-FEB-2012: further analysis has shown, that the Link Error Estimate VALUE of the FDDI ring at this site is quite low: actually 9 at the system with the highly reproducable MACHINECHK crashes and 10-12 on the other system at this site. Data from another site with a comparable configuration shows 15. That value indicates, that the FDDI error rate at the problematic system is about 1,000,000 times higher as at the other (good) site. It has also been noted that those sites have FDDI concentrators from different manufacturers. The 'good' site seems to have a FDDI Concentrator made by Digital.
The Link Error Estimate value X specifies the error rate as 10^-X errors per seconds.
Volker.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-20-2012 06:39 AM - edited 02-20-2012 06:42 AM
02-20-2012 06:39 AM - edited 02-20-2012 06:42 AM
Re: DMA Read Time-out / System crash with ES80
While researching this problem and trying to determine the source of the many CRC errors on the FDDI ring seen on this node, a day-zero bug in [LAN]PDQ.MAR has been found, causing a wrong 'source address' to be displayed:
Last CRC srcadr AA-03-00-00-00-08
This address shown is really some byte string inside the LLC data of the FDDI frame and NOT the FDDI$G_SA source address of the frame. A register was incorrectly being incremented by 7 (PDQ_C_RCV_HDR_SIZE) TWICE, when reporting CRC errors.
If required, this bug can worked around with PATCH/ABSOLUTE. The problem has been reported to OpenVMS LAN engineering.
Volker.