1827872 Members
1103 Online
109969 Solutions
New Discussion

Why shadow copy ?

 
Wim Van den Wyngaert
Honored Contributor

Why shadow copy ?

Cluster of 2 * 4000 with a HSZ50 controller.
OpenVMS 6.2 1H3.

I shut node 1, 20 seconds later node 2.
Both ended by dismounting the disks without any files open (except system disk).

I rebooted them with 2 seconds between the 2 b commands.

All disks are in shadow merge.

Why ?

Wim
Wim
31 REPLIES 31
Uwe Zessin
Honored Contributor

Re: Why shadow copy ?

Any pagefile or installed images on the disks?

There have been some, well, 'holes' in the dismount code, if I recall correctly.
.
Wim Van den Wyngaert
Honored Contributor

Re: Why shadow copy ?

Uwe,

Pagefile is open but not on ALL disks.
Where the dismounts to close to one another ?

Wim
Wim
Uwe Zessin
Honored Contributor

Re: Why shadow copy ?

If the code worked properly, then there should not be any race conditions.

Do you use any host-based RAID software except volume shadowing
(the striping driver for example)?
.
Volker Halle
Honored Contributor

Re: Why shadow copy ?

Wim,

if the transaction count (SHOW DEV D) on the disk is 1, it should dismount cleanly. You could add an appropriate SHOW DEV D command into SHUTDOWN.COM before the final DISMOUNT command. I once even added a SHOW DEV/FILES disk, IF F$GETDVI(disk,"TRANSCNT") .GT. 1 - you need to watch/capture the output of SHUTDOWN.COM to see, which disks would still have open files.

You cannot cleanly dismount a disk with a page-/swapfile installed. DISMOUNT is a synchronous command, so they can't be 'too close to one another'.

The HSZ50 is connected to a shared SCSI bus, so the disks are 'local' to each system, right ?

Volker.
Robert_Boyd
Respected Contributor

Re: Why shadow copy ?

I would have to say that I often had that kind of experience with V6.2 systems.

Later versions 7.2 and following have been much better. I would say for V6.2 that you need to allow at least 1 minute between them, possibly more. Or you could do the cluster shutdown -- but that always seemed to take forever for the final handshakes to complete.

Robert
Master you were right about 1 thing -- the negotiations were SHORT!
Wim Van den Wyngaert
Honored Contributor

Re: Why shadow copy ?

Uwe : no host based raid.

Volker : I do capture a show dev/fi of every disk. Nothing is open except indexf. Don't do a sh dev d yet. Yes, shared scsi.

Wim
Wim
Volker Halle
Honored Contributor

Re: Why shadow copy ?

Wim,

can you test this in the running system ? Try a DISM DSAx: on one of the shadowsets from both systems and then mount the shadowset again with the same command as in SYSTARTUP_VMS.COM - what happens ?

Volker.
Wim Van den Wyngaert
Honored Contributor

Re: Why shadow copy ?

Volker,

I have to wait until the merge is finished.

Wim
Wim
Wim Van den Wyngaert
Honored Contributor

Re: Why shadow copy ?

Volker,

Tried to do a dismount dsa14 a few seconds after the merge completed. The command hangs and is not reacting to control_y.
After a few minutes : the device dsa14 is in mntverifytimeout but normally accessable from the other node. Dismount still active (or better nonactive).

Wim
Wim
Volker Halle
Honored Contributor

Re: Why shadow copy ?

Wim,

sorry for that, but this indicates that something is wrong with your shadowing software or your disk configuration.

What's MVTIMEOUT set to ? What does SHOW DEV DSA14 tell on the system, where the dismount is hung ?

Now you need to start troubleshooting the hanging dismount with ANAL/SYS

SDA> SET PROC/ind=
SDA> SHOW PROC/LOCK

waiting lock shown first ?

If not:

SDA> SHOW PROC/CHAN

busy channel ?

SDA> SHOW DEV DSA14

Volker.
Wim Van den Wyngaert
Honored Contributor

Re: Why shadow copy ?

Don't be sorry.

SDA> show proc/lock

Process index: 0053 Name: SYSMGR_WVW Extended PID: 20600453
---------------------------------------------------------------
Lock data:

Lock id: 270011EF PID: 00010053 Flags: VALBLK CONVERT
Par. id: 01000000 SUBLCKs: 0
LKB: 81408E00 BLKAST: 00000000
PRIORTY: 0000

Granted at NL 00000000-FFFFFFFF

Resource: 45504F5F 24464D4C LMF$_OPE Status:
Length 18 504C412D 534D564E NVMS-ALP
Exec. mode 00000000 00004148 HA......
System 00000000 00000000 ........

Local copy

Process index: 0053 Name: SYSMGR_WVW Extended PID: 20600453
---------------------------------------------------------------
Lock data:

Lock id: 040017E0 PID: 00010053 Flags: SYSTEM
Par. id: 00000000 SUBLCKs: 0
LKB: 813A6C80 BLKAST: 00000000
PRIORTY: 0000

Granted at EX 00000000-FFFFFFFF

Resource: 4153445F 24544D44 DMT$_DSA Status:
Length 11 00000000 003A3431 14:.....
Exec. mode 00000000 00000000 ........
System 00000000 00000000 ........

Local copy

SDA> show proc/chan

Process index: 0053 Name: SYSMGR_WVW Extended PID: 20600453
---------------------------------------------------------------


Process active channels
-----------------------

Channel Window Status Device/file accessed
------- ------ ------ --------------------
0010 00000000 Busy DSA14:
0020 812201C0 DSA0:[SYSCOMMON.SYSEXE]DISMOUNT.EXE;2
0030 813BB9C0 DSA0:[SYSCOMMON.SYSLIB]DISMNTSHR.EXE;3 (
section file)
0040 00000000 DSA14:
0050 813B4840 DSA0:[SYSCOMMON.SYSEXE]DCL.EXE;1 (sectio
n file)
0060 813C1280 DSA0:[SYSCOMMON.SYSLIB]DCLTABLES.EXE;210
(section file)
0070 00000000 Busy DSA14:
0080 00000000 TNA5:
0090 00000000 TNA5:
00A0 81723740 DSA0:[SYSCOMMON.SYSMSG]UCX$MSG.EXE;1
SDA>

mc sysgen show mv
Parameter Name Current Default Min. Max. Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
MVTIMEOUT 3600 3600 1 64000 Seconds D
Wim
Wim Van den Wyngaert
Honored Contributor

Re: Why shadow copy ?

sh dev dsa14/fu

Disk DSA14:, device type Generic SCSI disk, is online, mounted, mount
verification timed out, volume is marked for dismount, file-oriented device,
shareable, available to cluster, error logging is enabled.

Error count 0 Operations completed 280875
Owner process "" Owner UIC [SYSTEM]
Owner process ID 00000000 Dev Prot S:RWPL,O:RWPL,G:R,W
Reference count 3 Default buffer size 512
Total blocks 35556389 Sectors per track 254
Total cylinders 7000 Tracks per cylinder 20

Volume label "DISK14" Relative volume number 0
Cluster size 35 Transaction count 1
Free blocks 7537810 Maximum files allowed 493838
Extend quantity 5 Mount count 1
Mount status Process Cache name "_DSA0:XQPCACHE"
Extent cache size 64 Maximum blocks in extent cache 753781
File ID cache size 64 Blocks currently in extent cache 0
Quota cache size 0 Maximum buffers in FCP cache 2993
Min ret. period 3-00:00:00.00 Max ret. period 7-00:00:00.00
Volume owner UIC [1,201] Vol Prot S:RWCD,O:RWCD,G:RWCD,W:RWCD

Volume Status: do not unload on dismount, write-back caching enabled.
Volume is also mounted on SBAPV2.
Wim
Peter Quodling
Trusted Contributor

Re: Why shadow copy ?

I have seen, even on a 7.3-2 system a situation, where the shutdown doesn't

The other thing is --- Is everything that needs to be shutdown, shutdown . Some show dev/files interspersed in the shutdown code, will confirm, that things are happening as they should.

q
Leave the Money on the Fridge.
Wim Van den Wyngaert
Honored Contributor

Re: Why shadow copy ?

Peter,

Everything is stopped during the shutdown. Even spooling. Show dev/fi didn't show anything.

Volker : nothing in operator log file. No msg at all.
Wim
Volker Halle
Honored Contributor

Re: Why shadow copy ?

Wim,

DISMOUNT is not waiting for a LOCK (waiting locks would be shown first). Busy channel to DSA14:, so now it's time to do a SDA> SHOW DEV DSA14 and see, why the IO does not continue...

MVTIMEOUT=3600 means 1 hour - I guess you didn't wait that long...

Any error count on DSA14 or any of it's members ($ SHOW DEV DSA14: from the node with the hanging DISMOUNT) ?

Are all the expected shadowset members of DSA14: still part of the VU ?

Volker.
Wim Van den Wyngaert
Honored Contributor

Re: Why shadow copy ?

Volker,

No errors on the node.


SDA> show dev dsa14

I/O data structures
-------------------
DSA14 Generic_DK UCB address: 8147E400

Device status: 08140010 online,supmvmsg,dismount,exfunc_supp
Characteristics: 1C6D4008 dir,fod,shr,avl,mnt,dmt,elg,idv,odv,rnd
01082021 clu,mscp,loc,vrt,scsi

Owner UIC [000001,000004] Operation count 280875 ORB address 8147AC00
PID 00000000 Error count 0 DDB address 81276800
Alloc. lock ID 1D000732 Reference count 3 DDT address 8857E3A0
Alloc. class 0 Online count 0 VCB address 81476F40
Class/Type 01/36 BOFF 00000000 CRB address 81276880
Def. buf. size 512 Byte count 00000000 CDDB address 812E4780
DEVDEPEND 1B5814FE SVAPTE 00000000 SHAD address 81483D40
DEVDEPND2 00000000 DEVSTS 00000004 I/O wait queue 8147E46C
DEVDEPND3 00000000 RWAITCNT 0000
FLCK index 3A
DLCK address 00000000

Shadow Virtual Unit DEVSTS status: 00000004 nocnvrt

IO data structures
-------------------

----- Shadow Descriptor Block (SHAD) 81483D40 -----

Virtual Unit status: 8000 failed

Members 0 Act user IRPs 2 VU UCB 8147E400
Devices 0 SCB LBN 010F4627 Master FL 814840A4
Fcpy Targets 0 Generation Num E5E973A4 Restart FL 814840AC
Mcpy Targets 0 00A49E65
Last Read Index 1 Virtual Unit Id 00000000
Master Index 0 1261000E

----- SHAD Device summary for DSA14 -----

I/O data structures
-------------------

--- Primary Class Driver Data Block (CDDB) 812E4780 ---

Status: 00000000
Controller Flags: 00D0 cf_this,cf_misc,cf_attn

Allocation class 0 CDRP Queue 812E4780 DDB address 81276800
System ID 00000000 Restart Queue 812E47C8 CRB address 81276880
00000000 DAP Count 0 CDDB link 8134B380
Contrl. ID 00000000 Contr. timeout 0 PDT address 00000000
00000000 Reinit Count 0 Original UCB 00000000
Response ID 00000000 Wait UCB Count 0 UCB chain 812E49C0
MSCP Cmd status 00000000

*** I/O request queue is empty ***

I/O data structures
-------------------

--- Volume Control Block (VCB) 81476F40 ---

Volume: DISK14 Lock name: DISK14
Status: 20 extfid
Status2: 10 nohighwater
Status3: 00000000
Shadow status: 01 shadmast

Mount count 0 Rel. volume 0 AQB address 81380240
Transactions 1 Max. files 493838 RVT address 8147E400
Free blocks 7537810 Rsvd. files 10 FCB queue 81483A00
Window size 7 Cluster size 35 Cache blk. 814833C0
Vol. lock ID 01000744 Def. extend sz. 5 Shadow mem. FL 81477008
Block. lock ID 0600112D Record size 0 Shadow mem. BL 81477008
Shadow lock ID 00000000

I/O data structures
-------------------

--- Shadow set DSA14 member summary ---

Volume: DISK14

Physical unit Primary path Secondary path Member status
------------- ------------ -------------- -------------

* No mounted members in shadow set *

I/O data structures
-------------------

--- ACP Queue Block (AQB) 81380240 ---

ACP requests are serviced by the eXtended Qio Processor (XQP)

Status: 14 defsys,xqioproc

Mount count 19 ACP type f11v2 Linkage 812DBF80
ACP class 129 Request queue 00000000

*** ACP request queue is empty ***
Wim
Wim Van den Wyngaert
Honored Contributor

Re: Why shadow copy ?

No errors on the other cluster node.
The memers are not in DSA14.

Wim
Wim
Volker Halle
Honored Contributor

Re: Why shadow copy ?

Wim,

sure, both members have been dropped from the shadowset... That's why it shows up as mntverifytimeout.

The SHAD shows: Act user IRPs 2 - on a 0-member VU ??? This must be a shadowing problem. Do you have the most recent shadowing patch installed (ALPSHAD14_062) ?

You will need to reboot the node with the hanging dismount. You may try a DISM/ABORT DSA14: as a last resort first.

Volker.
Wim Van den Wyngaert
Honored Contributor

Re: Why shadow copy ?

Dismount/abort on both nodes. Bioth in hang.

Show dev dsa14 on the node that last dismounted : still normal.

Shutdown : process of dism goes into RWAST.
Boot blocks. Reboot asked (damned, crash would have been better).

The 2nd node in the mean time has a dismounted dsa14. Operator.log says node1 did it.

Rebooted 2nd node without problems.

No patching was allowed on this node. So, patch not installed.

Wim
Wim
Wim Van den Wyngaert
Honored Contributor

Re: Why shadow copy ?

And after the boot no merge on dsa14 !
Wim
Wim Van den Wyngaert
Honored Contributor

Re: Why shadow copy ?

After the reboots did the test of dismounting dsa14. No problem whatsoever.

Wim
Wim
Mike Reznak
Trusted Contributor

Re: Why shadow copy ?

Hi,
no patch installed?

The installation should solve the problem.

Also OpenVMS V7.3-2 with minimerge would help you. But there is probably a reason for not upgrading.

Mike
...and I think to myself, what a wonderful world ;o)
Wim Van den Wyngaert
Honored Contributor

Re: Why shadow copy ?

Reason is that this was a SWIFT cluster. No upgrade without full retest.

But if I get into problems that easy, I wonder how it was tested.

Wim
Wim
Robert Brooks_1
Honored Contributor

Re: Why shadow copy ?

There was a SUBSTANTIAL rewrite of the shadowing code for V7.0 that was backported to V6.2 (released, of course as part of a SHADOWING patch kit). I STRONGLY encourage you to install the latest shadowing kit for V6.2. I'm not sure which patch kit contains the updated code, but installing the latest kit would be my recommendation.

Note -- I have no idea if this patch kit will address your problem, but I'm not particularly motivated to investigate a problem on quite ancient software.

Also, please attempt to keep straight the distinction between a shadowing copy and merge. Your title refers to a copy operation; your issue is with unwanted merges. These concepts are explained rather well in the shadowing manuals.

-- Rob