Re: %SYSTEM-W-DATAOVERUN, data overrun

Lisa Collins · ‎04-13-2006

Hello,

We are trying to do a copy of data between an Alpha running OpenVMS 7.3-1 to an Itanium running OpenVMS V8.2-1. Because we need an entire directory structure copied, not just data within one directory, we are using the backup command. We are attempting to do a backup across Decnet and we are receiving an error. The save set on the destination node gets created, and starts to allocate thousands of blocks. The process is then interrupted with a data overrun. Here are the messages we receive.

NXXXXA>@BCK2NXXB.COM
$ set ver
$ show time
13-APR-2006 09:45:47
$ backup/list=dra5:[000000]bck2nxxb.lst -
dra5:[PMDF...] -
nxxxxb"bkbkbk xxxxxxxx"::$1$DKC2:[000000]nXXa.bck/sav
%BACKUP-F-WRITEERR, error writing NXXXXB"bkbkbk password"::$1$DKC2:[000000]NXXA.BCK;1
-RMS-F-SYS, QIO system service request failed
-SYSTEM-F-LINKABORT, network partner aborted logical link

NXXXXB FAL Log =================================================

NXXXXB>type sys$manager:NET$SERVER.LOG
$ Set NoOn
$ VERIFY = F$VERIFY(F$TRNLNM("SYLOGIN_VERIFY"))

--------------------------------------------------------

Connect request received at 13-APR-2006 09:46:12.88
from remote process IP$10.100.50.18::"0=BKBKBK"
for object "SYS$COMMON:[SYSEXE]FAL.EXE"

--------------------------------------------------------

%SYSTEM-W-DATAOVERUN, data overrun
job terminated at 13-APR-2006 09:52:00.97

Accounting information:
Buffered I/O count: 55932 Peak working set size: 5760
Direct I/O count: 5000 Peak virtual size: 177808
Page faults: 1220 Mounted volumes: 0
Charged CPU time: 0 00:00:02.44 Elapsed time: 0 00:11:06.35

We are copying mailboxes and emails so we will have to shutdown email while we are trying to move the users data. We realize we could do a backup on the same node where the data sits. Then copy the saveset over to the new node. Then unpack it on the new nod. But we are trying to save time by doing the backup and copy in one step if possible. Any suggestions would be appreciated.

Thanks,
Lisa Collins

Steven Schweda · ‎04-13-2006

Have you tried?:

SET RMS_DEFAULT /NETWORK_BLOCK_COUNT = bigger_number

and/or

BACKUP /BLOCK_SIZE = smaller_number

> Any suggestions would be appreciated.

Always a risk before you see the suggestions.

Volker Halle · ‎04-13-2006

Lisa,

for further analysis, consider to DEF/SYS FAL$LOG FF on the remote node and repeat your BACKUP operation. You'll find a full FAL DAP trace in the remote NET$SERVER.LOG and should be able to find out, exactly which operation fails with DATAOVERUN.

Note that there was a patch for a similar sounding problem in VMS732_RMS-V0200. Check the value of the NETWORK BLOCK count on both systems with SHOW RMS.

Volker.

Lisa Collins · ‎05-03-2006

Volker,

Sorry it has taken so long to get back to your reply. I did what you said and here is the latter part of my net$server.log

14:13:11.70 Receive QIO issued
14:13:11.70 Receive AST delivered 4106 bytes
---> DAT msg 4104 - 39 6C 4F 4B 63 42 7A 75 74 70 75 35 01 D4 A1 03 10 04 06 08
---> CRC msg 2 - 6356
14:13:11.70 Receive QIO issued
14:13:11.70 Receive AST delivered 4106 bytes
---> DAT msg 4104 - 66 45 79 4C 41 2F 37 56 4F 50 5A 5A 01 D4 A9 03 10 04 06 08
---> CRC msg 2 - C4D3
14:13:11.71 Receive QIO issued
14:13:11.71 Receive AST delivered 20 bytes
---> DAT msg 4104 - 6A 42 47 44 63 4F 4F 37 52 43 38 31 01 D4 B1 03 10 04 06 08
---> CRC msg 2 - 73AA
DAP status code of 50C8 generated
<--- STS msg 4 - 50 C8 00 09
<--- CRC msg 2 - 278F
14:13:11.71 XMT QIO complete, 6 bytes

Logical link was terminated on 3-MAY-2006 14:13:11.71
Mailbox message type 0035 received

Total connect time for logical link was 0 00:01:48.32
Total CPU time used for connection was 0 00:00:01.97

File Access Statistics for RECV-Side XMIT-Side Composite
-------------------------- --------- --------- ---------
# DAP Message QIO Calls 15004 4 15008
# DAP Messages Exchanged 16671 6 16677
# User Records/Blocks 16665 0 16665
# Bytes of User Data 61435904 0 61435904
# Bytes in DAP Layer 61590203 88 61590291
User Data Throughput (bps) 0 0 0
DAP Layer Throughput (bps) 0 0 0
Average Record/Block Size 0 0 0
% User Data in DAP Layer 0.0% 0.0% 0.0%
-------------------------- --------- --------- ---------

Negotiated DAP buffer size = 4156 bytes
Buffered I/O count during connection = 31235
Direct I/O count during connection = 9636
Peak working set size for process = 5648 pages

Successful Start Transaction Branch = 0
Start Transaction Branch loops = 0

Total RECV_WAIT = 969 and XMIT_WAIT not kept
Total READ_WAIT not kept and WRIT_WAIT = 1067
Defered AST PUT's = 230, Lost AST logging messages = 1
COUNTER1 = 0 and COUNTER2 = 0
COUNTER3 = 0 and COUNTER4 = 0

FAL terminated execution on 3-MAY-2006 14:13:11.72
========================================================

A Show RMS on the originating server shows:

$ sh rms
MULTI- | MULTIBUFFER COUNTS | NETWORK
BLOCK | Indexed Relative Sequential | BLOCK
COUNT | Disk Magtape Unit Record | COUNT
Process 0 | 0 0 0 0 0 | 0
System 32 | 0 0 0 0 0 | 8

On the remote node it shows the same

MULTI- | MULTIBUFFER COUNTS | NETWORK
BLOCK | Indexed Relative Sequential | BLOCK
COUNT | Disk Magtape Unit Record | COUNT
Process 0 | 0 0 0 0 0 | 0
System 32 | 0 0 0 0 0 | 8

Thank you, Lisa

Volker Halle · ‎05-03-2006

Lisa,

14:13:11.71 Receive QIO issued
14:13:11.71 Receive AST delivered 20 bytes
---> DAT msg 4104 - 6A 42 47 44 63 4F 4F 37 52 43 38 31 01 D4 B1 03 10 04 06 08
---> CRC msg 2 - 73AA
DAP status code of 50C8 generated
<--- STS msg 4 - 50 C8 00 09
<--- CRC msg 2 - 278F
14:13:11.71 XMT QIO complete, 6 bytes

This seems to be a different kind of error !

The DAP status message returned is:

0x50C8 = MAC: 5 MIC: 310 (in octal notation)

MAC code 5 indicates: FILE_XFER - Error encountered while file was open

MIC code 310 (octal) seems to indicate: CRC error

If you look at the exchange of DAP messages in your trace, this would make sense. This DAP status message is immediately returned after receiving the CRC message. So something got corrupted while in transit over the network. RMS/FAL use end-to-end CRC checks for additional data protection.

You're using DECnet-over-IP (indicated by the IP$... remote node name string at the beginning of NET$SERVER.LOG).

What error did you get on the node running the BACKUP command ?

You should be able to test the reliabilty of the network connection between the 2 nodes using

NCL> LOOP loopback applic name domain:10.100.50.18, length 4096, count 1000

You can also add ,FORMAT xx to specify hex bit pattern to be used inside the looped messages. If any data corruption occurs, there will be an error message. There will be no message, if the loopback test succeeds.

If this really is a true CRC error, the corruption can occur anywhere on the sending node, the network or on the receiving node.

Volker.

Volker Halle · ‎05-03-2006

Lise,

just for reference, here is a pointer to the DAP Protocol specification:

http://ftp.digital.com/pub/DEC/DECnet/PhaseIV/dap_v5_6_0.txt

This is the protocol used between RMS and FAL for remote file access operations in DECnet.

Volker.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: %SYSTEM-W-DATAOVERUN, data overrun

%SYSTEM-W-DATAOVERUN, data overrun