Re: Extremely slow backup to LTO Tape after replacement of hot plug (shadowed) internal SCSI disk

MarkOfAus · ‎08-06-2014

Hello all,

On the 28th July, 8am, I hot swapped out a failed SCSI disk that had failed with a "new" one sent by HP. This disk is a member of a shadow set called DSA3:

Immediately the error count on all the drives on the PKA0: increased their error count by one. I reset the error count then did a show device d/rebuild:

DSA1:                   Yes
DSA2:                   Yes
DSA3:                   No
$3$DKA0:                No
$3$DKA400:              No

So I performed a set volume/rebuild on DSA1: and DSA2:

I then walked away, happy with my achievements...

A nightly backup is performed on the DS25 (EMU1) ONLY to the MKC500: device.

Normally, backups to an LTO take just two hours, but the next day, and all subsequent days (I had been away) the backup is now taking upwards of 8 hours.

Magtape $3$MKC500: (EMU1), device type HP Ultrium 2-SCSI, is online, allocated,
    mounted foreign, record-oriented device, file-oriented device, available to
    cluster, error logging is enabled, controller supports compaction
    (compaction enabled), device supports fastskip (per_io).

    Error count                    0    Operations completed          381412841
    Owner process        "BATCH_966"    Owner UIC               [COMCEN,BACKUP]
    Owner process ID        2CC84AF3    Dev Prot            S:RWPL,O:RWPL,G:R,W
    Reference count                4    Default buffer size               32256

    Volume label            "AUG7  "    Relative volume no.                   0
    Record size                    0    Transaction count                     1
    Mount status             Process    Mount count                           1
    ACP process name              ""
    Density                  default    Format                        Normal-11
    Allocation class               3

  Volume status:  odd parity.

I say normally, in that the space or file makeup on the disk(s) has not change dramatically in months, so I have a normal average of around 2.45 hours. A daily backup runs from midnight to approx 2:45 am when all is working correctly.

I must add, the sequence of backup is thus:

DSA2:, DSA1:, DSA3: and then $3$DKA400

Background:

OpenVMS 7.3-2 clustered system.

EMU1 (DS25) is the primary server, EMU2 (DS20) is the disaster server.

DS25 (alloclass 3) with internal SCSI shadowed to a remote (approx 1 kilometre) DS20 (alloclass 4). This is a fibre link via EWA <-> fibre converters.

3 sets of shadowed disks, DSA1:, DSA2: and DSA3: all comprised of 1 disk on DS25 and another on the DS20.

On the DS25, output of show device d:

DSA1:                   Mounted              0  DATA1         136.50GB   124   2
DSA2:                   Mounted              0  DATA2         206.39GB     6   2
DSA3:                   Mounted              0  DATA3          17.70GB  1192   2
$3$DKA0:        (EMU1)  Mounted              0  ALPHASYS       23.48GB  1713   1
$3$DKA100:      (EMU1)  ShadowSetMember      0  (member of DSA1:)
$3$DKA200:      (EMU1)  ShadowSetMember      0  (member of DSA2:)
$3$DKA300:      (EMU1)  ShadowSetMember      0  (member of DSA3:)
$3$DKA400:      (EMU1)  Mounted              0  ADMIN          28.20GB    35   1
$3$DQA0:        (EMU1)  Online               0
$3$DQA1:        (EMU1)  Offline              1
$3$DQB0:        (EMU1)  Offline              1
$3$DQB1:        (EMU1)  Offline              1
$4$DKB0:        (EMU2)  Mounted              0  (remote mount)                 1
$4$DKB100:      (EMU2)  ShadowSetMember      0  (member of DSA1:)
$4$DKB200:      (EMU2)  ShadowSetMember      0  (member of DSA2:)
$4$DKB300:      (EMU2)  ShadowSetMember      0  (member of DSA3:)
$4$DKB400:      (EMU2)  Mounted              0  (remote mount)                 1
$4$DQA0:        (EMU2)  Online               0

This system has been running for years without any glitches, with the only thing done in recent history being the replacement of a failed disk on the DS25 for the DSA3: shadow set.

I attach the analyze/system sho dev dsa1 and dsa2 if that helps.

I also have T4 data for periods before, after and during the exchange of disk. In summation, it shows a continual LOW Direct IO during the backup period over the 8 to 10 hours rather than the high rates normally seen for 2 hours.

Thank you for your assistance,

Mark

(I wasn't going to try my luck adding more attachments, so anything else like T4 I can append to another reply/post)

Volker Halle · ‎08-07-2014

Mark,

what's slowing down the backup now ?

- slower writes to tape ?

- slower reads from disks ?

- slower intersite link ?

If you look at the T4 data for the disks (during backup), do you now see larger than usual queues '[MON.DISK]Qln' on the local or remote mbrs ? Or larger Response times?

What I do to the test read throughput from disks is:

$ SPAWN/NOWAIT BACKUP/PHY/NOCRC/GROUP=0 disk: NLA0:x.x/SAVE then $ MONI DISK and look at the Dirio-Rate

You could do this for the shadowsets and also for individually mounted disks from the local and remote site.

Then try a BACKUP/PHYSICAL from a local disk to tape.

Any resource problems (nonpaged pool) on the node running backup ?

Volker.

MarkOfAus · ‎08-07-2014

Volker,

what's slowing down the backup now ?

 

- slower writes to tape ?

- slower reads from disks ?

- slower intersite link ?

My friend, if I knew that, the problem would be solved.

We back up from the main server (DS25) so I assume (correctly?) that all disk access is local if there are served disks in the shadow set that are local, which there are.

Therefore, I have only T4 data at hand for the DS25.

If you look at the T4 data for the disks (during backup), do you now see larger than usual queues '[MON.DISK]Qln' on the local or remote mbrs ? Or larger Response times?

The mon.disk qlen is abnormally high for the duration of 10 hours the backup is running. I will attach the 26th July T4 (before the disk replacement) and the T4 after the disk replacement, in this case I've picked 5-aug-2014.

You can see for yourself the stark difference. The backup normally runs from midnight till about 3am, as it did on the 26th July. Compare that to the backup on the 5th August, where the Qlen is high from midnight until 10:30am when it finally finishes (or I think it was killed off because of impact on the system - I have to find that out).

The disk oprate is normally a spike for 2-3 hours, but now it's a steady sprawl over 10 hours.

What I do to the test read throughput from disks is:

 

$ SPAWN/NOWAIT BACKUP/PHY/NOCRC/GROUP=0 disk: NLA0:x.x/SAVE then $ MONI DISK and look at the Dirio-Rate

 

You could do this for the shadowsets and also for individually mounted disks from the local and remote site.

I will get back to you after running this (I need to wait for the backup to finish before clobbering the system more...).

Then try a BACKUP/PHYSICAL from a local disk to tape.

 

Any resource problems (nonpaged pool) on the node running backup ?

It would be odd (far too eerily coincidental) that a change in disk would correspond with that.

Anyway, sysgen reports:

Parameter Name           Current    Default     Min.      Max.     Unit  Dynamic
--------------           -------    -------    -------   -------   ----  -------
NPAGEDYN                 39993344    1048576    163840         -1 Bytes
NPAGEVIR                156827648    8388608    163840         -1 Bytes
NPAG_BAP_MIN                40960          0         0         -1 Bytes
NPAG_BAP_MAX               131072          0         0         -1 Bytes
NPAG_BAP_MIN_PA                 0          0         0         -1 Mbytes
NPAG_BAP_MAX_PA              1024         -1         0         -1 Mbytes
NPAG_RING_SIZE               2048       2048         0         -1 Entries
NPAGECALC                       0          1         0          2 Coded-valu
NPAGERAD                        0          0         0         -1 Bytes
NPAG_INTERVAL                  30         30         0         -1 Seconds    D
NPAG_GENTLE                    85         85         0        100 Percent    D
NPAG_AGGRESSIVE                50         50         0        100 Percent    D
SYSGEN>

Backup is running now so here tis:

 

...$COMMON:COMMON.ADMIN> SHOW MEMORY/POOL/FULL
              System Memory Resources on  8-AUG-2014 10:20:22.33

Nonpaged Dynamic Memory      (Lists + Variable)
  Current Size (MB)                38.14   Current Size (Pagelets)     78112
  Initial Size (MB)                38.14   Initial Size (Pagelets)     78112
  Maximum Size (MB)               149.56   Maximum Size (Pagelets)    306304
  Free Space (MB)                  17.96   Space in Use (MB)           20.17
  Largest Var Block (MB)           10.11   Smallest Var Block (bytes)     64
  Number of Free Blocks             9941   Free Blocks LEQU 64 bytes    2017
  Free Blocks on Lookasides         5123   Lookaside Space (MB)         1.85

Bus Addressable Memory       (Lists + Variable)
  Current Size (KB)               128.00   Current Size (Pagelets)       256
  Initial Size (KB)               128.00   Initial Size (Pagelets)       256
  Free Space (KB)                 110.87   Space in Use (KB)           17.12
  Largest Var Block (KB)          104.00   Smallest Var Block (KB)      6.87
  Number of Free Blocks                2   Free Blocks LEQU 64 bytes       0
  Free Blocks on Lookasides            0   Lookaside Space (bytes)         0

Paged Dynamic Memory
  Current Size (MB)                 7.68   Current Size (Pagelets)     15744
  Free Space (MB)                   4.17   Space in Use (MB)            3.50
  Largest Var Block (MB)            4.12   Smallest Var Block (bytes)     16
  Number of Free Blocks              227   Free Blocks LEQU 64 bytes     198

Lock Manager Dynamic Memory
  Current Size (MB)                56.89   Current Size (Pages)         7283
  Free Space (MB)                  50.73   Hits                     38995840
  Space in Use (MB)                 6.16   Misses                          0
  Number of Empty Pages             6228   Expansions                   7283
  Number of Free Packets          207799   Packet Size (bytes)           256

Write Bitmap (WBM) Memory Summary
  Local bitmap count:     0     Local bitmap memory usage (bytes)       0.00
  Master bitmap count:    0     Master bitmap memory usage (bytes)      0.00
...$COMMON:COMMON.ADMIN>

Some timings (this is just indicative - other things impact it, I know):

DSA1: 144GB used, takes 5:25 hours to backup (lots of small files, large files (300 MB each) as well)

DSA2: 73GB used, takes 4:51 hours to backup (lots of small user files on this disk)

DSA3: 16GB used, takes 16minutes - system shared software - UAF/security/rightslist/queues etc.

DKA400: 5GB used - ODS-5 (all the rest of the disks are ODS-2) - Oracle Client Disk. Takes 49 minutes to backup. Non-shadowed. Lots of small files and other detritus Oracle places there.

Thanks for your assistance.

(This forum software is rubbish. I set my included output to Courier new and it removes it!)

(I also attempt to attach the .csv from T4 to which this junk software says:

"The contents of the attachment doesn't match its file type.")

Volker Halle · ‎08-08-2014

Mark,

interesting data ;-)

Let's just look at the BACKUP of the first shadowset DSA2: (local $3$DKA200 and remote $4$DKB200):

Saturday, 26-JUL 00:28-01:46, Opcnt 500 to 1500, then about 150 starting at 01:03 (BACKUP/RECORD phase ?)

Qlen about 5, then about 1 starting at 01:03, Response-time 6-8 ms

Tuesday, 5-AUG 00:30-05:54, Opcnt 80-190, then about about 150 starting at 05:23 (BACKUP/RECORD phase ?)

Qlen about 7 (all the time), then 1 at 05:23, Response-time 80-120 ms, 6 ms after 05:23

Even the BACKUP of DSA3: on 26-JUL (from the REMOTE mbr $4$DKB300, as $3$DKA300 was not present) seems to have performed well: Opcnt about 1000, Qlen=6, Response-time=5-6 ms

Nearly zero IOs on the remote member, except during the BACKUP/RECORD (?) phase. But this phase shows very similar behaviour BEFORE and AFTER replacement of $3$DKA300:

Could the local SCSI adapter/bus have some problems after hot-swapping $3$DKA300: ?

For disk throughput comparison, I've collected some data from a DS25 (V7.3-1, Symbios 895, HP U320 15K RPM 36.4 GB disk). This SCSI adapter should have a max. throughput of 80 MB/sec in LVD mode. 2-mbr shadowsets similar to your config:

$ BACKUP/PHYS/NOCRC/GROUP=0 DSAx: NLA0: x.x/SAVE achieves about 2234 DIRIO/sec -> 70 MB/sec, constant queue length of about 405 (MONI DISK/ITEM=Q/INT=1)

Same results on the 'remote' DS25 with 10K 36.4 GB disk (locally mounted, not a shadowset mbr). The limiting throughput factor seems to be the Symbios 895 in this case.

You could perform a BACKUP/IMAGE DSA2: NLA0:x.x/SAVE and if you also get the similar 'bad performance', try to temporarily DISMOUNT/POLICY=MINI $3$DKA200: (the local mbr) and see if throughput increases ! If so, this would confirm the local SCSI adapter/bus as a suspect.

Maybe with even less impact on your system, you can try the above BACKUP test against $3$DKA400: and also against $4$DKB400: (after first mouting it on EMU1).

Update: you may also want to play with the Read Cost parameter. The remote mbr of DSA2: has a read cost of 501. You may try to reduce that and increase the read cost of the local mbr. This way you could force reads from remote, if that member provides better performance.

Volker.

MarkOfAus · ‎08-08-2014

Hi Volker,

"Mark,

interesting data ;-)

"

In your element, hey, Volker :-) You just love helping me! :-)

I apologise I didn't give you the DIO figures earlier, but the system's been busy.

Anyway, as per your suggestion, here are the results of SPAWN/NOWAIT BACKUP/PHY/NOCRC/GROUP=0 disk: NLA0:x.x/SAVE

DSA1: Approx 76 max.

DSA2: Approx 77 max.

DSA3: Approx 2767 max.

Snap shot of DSA3:

DSA3: DATA3 2767.66 146.75 0.00 2767.66
DSA3: DATA3 2798.66 372.27 0.00 2798.66

DSA3: DATA3 2762.66 433.01 0.00 2798.66

A marked difference!

So here's a summary after running DSA1:, DSA2:, DSA3: and $3$DKA400:

                            OpenVMS Monitor Utility
                              DISK I/O STATISTICS
                                 on node EMU1
                             8-AUG-2014 19:52:06.38

I/O Operation Rate                         CUR        AVE        MIN        MAX

$3$DKA0:      (EMU1)   ALPHASYS           4.00       1.10       0.00      24.66
$3$DKA100:    (EMU1)   DATA1              0.00      36.45       0.00      76.00
$3$DKA200:    (EMU1)   DATA2              0.00       8.28       0.00      76.66
$3$DKA300:    (EMU1)   DATA3              0.66     551.49       0.00    2834.33
$3$DKA400:    (EMU1)   ADMIN             21.33       7.75       0.00      81.00
DSA1:                  DATA1              0.00      36.45       0.00      76.00
DSA2:                  DATA2              0.00       8.28       0.00      76.66
DSA3:                  DATA3              0.66     548.33       0.00    2834.33
$4$DKB100:    (EMU2)   DATA1        R     0.00       0.00       0.00       0.00
$4$DKB200:    (EMU2)   DATA2        R     0.00       0.00       0.00       0.00
$4$DKB300:    (EMU2)   DATA3        R     0.66      11.09       0.00      50.33

"Even the BACKUP of DSA3: on 26-JUL (from the REMOTE mbr $4$DKB300, as $3$DKA300 was not present) seems to have performed well: Opcnt about 1000, Qlen=6, Response-time=5-6 ms

Nearly zero IOs on the remote member, except during the BACKUP/RECORD (?) phase. But this phase shows very similar behaviour BEFORE and AFTER replacement of $3$DKA300:"

Oh yes, I forgot to mention, sorry, that the disk had been dead a week before I had a chance to install the new one, so yes, the backup of 26-jul-2014 would indeed have used $4$dkb300 as the primary source for reading when backing up DSA3:. You have been paying attention. ;-)

You are correct, the backup/record phase is quite intense because these are FULL backups, not incrementals. Again, I forgot to mention this.

"Could the local SCSI adapter/bus have some problems after hot-swapping $3$DKA300: ?"

Ah, thank you for suggesting this. If you are suggesting this after looking at the data, then perhaps I am not as mad as I think. I too believe this must be where the issue lies. See below for more confirmation.

Ok, so I ran the SPAWN/NOWAIT BACKUP/PHY/NOCRC/GROUP=0 disk: NLA0:x.x/SAVE on the remote server, and looky here at the results:

                              DISK I/O STATISTICS
                                 on node EMU2
                             8-AUG-2014 20:15:45.15

I/O Operation Rate                         CUR        AVE        MIN        MAX

$4$DKB0:      (EMU2)   ALPHASYS1          0.00       0.57       0.00      23.00
$4$DKB100:    (EMU2)   DATA1              0.00     328.45       0.00    1148.00
$4$DKB200:    (EMU2)   DATA2              0.00     195.70       0.00    1152.00
$4$DKB300:    (EMU2)   DATA3              1.00      98.69       0.00     942.00
$4$DKB400:    (EMU2)   ADMIN1             0.00      88.00       0.00    1093.33
DSA1:                  DATA1              0.00     328.45       0.00    1148.00
DSA2:                  DATA2              0.00     195.70       0.00    1152.00
DSA3:                  DATA3              0.66      90.13       0.00     942.00
$3$DKA100:    (EMU1)   DATA1        R     0.00       0.00       0.00       0.00
$3$DKA200:    (EMU1)   DATA2        R     0.00       0.00       0.00       0.00
$3$DKA300:    (EMU1)   DATA3        R     0.33      25.90       0.00     194.33

There's just a little bit of difference... :-)

So, as you suggest, this is a controller issue, almost certainly on PKA

(analyze/system clue scsi /summary gives us this: )

OpenVMS (TM) system analyzer


SCSI Summary Configuration:
---------------------------
SPDT      Port  STDT   SCSI-Id  SCDT  SCSI-Lun  Device    UCB       Type    Rev
--------------  --------------  --------------  --------  --------  ------  ----
8218D540  PKD0

821874C0  PKC0
                82184040     5
                                821CC4C0     0  MKC500    821A5800  Ultriu  F63D

82051A40  PKB0

81F53300  PKA0
                81F57F00     0
                                81F58100     0  DKA0      81ED0B80  BF0368  HPB2
                820D9C80     1
                                81EC6840     0  DKA100    82171500  BD3008  HPB4
                82172080     2
                                82172280     0  DKA200    82171A40  BD3008  HPB4
                82172F80     3
                                82173180     0  DKA300    82172700  BF0368  HPB9
                82173E80     4
                                82174080     0  DKA400    82173600  BF0368  HPB9

You're a genius Volker. I didn't think to look at the other system to compare its throughput. I think I might add a wrapper around the batch job for the backup to take out the local disk members of DSA1:, DSA2: and DSA3: and hope to speed up the backup. Today the backup ended at 11:30 so by the middle of next week, it might be finishing after midday - not a recipe for happy users (doing a directory command is horrible when this is happening).

I will start a case with HP.

Cheers

Mark

Volker Halle · ‎08-08-2014

Mark,

why does the DSA3: (local $3$DKA300) backup test perform so well ?

Did you see my update regarding 'read cost' ( $ SET DEV/READ_COST=n $3$DKcn:) ? If you can get more throughput this way, it will be safer, as the 2nd members don't need to be dismounted.

Volker.

MarkOfAus · ‎08-08-2014

"Mark,

why does the DSA3: (local $3$DKA300) backup test perform so well ?"

Beats me. Perhaps it was an anomaly? I will run it again.

"Did you see my update regarding 'read cost' ( $ SET DEV/READ_COST=n $3$DKcn:) ? If you can get more throughput this way, it will be safer, as the 2nd members don't need to be dismounted."

No, I didn't, sorry.

You know, for the life of me, I knew there was a setting to change the "priority" of the disks in a shadow set and I couldn't think of it. Thanks for reminding me of that.

I agree this would be a safer option, given the reason for the shadow set across multiple sites is for data security.

MarkOfAus · ‎08-08-2014

Ran it again on DSA3:

                            OpenVMS Monitor Utility
                              DISK I/O STATISTICS
                                 on node EMU1
                             8-AUG-2014 20:59:31.19

I/O Operation Rate                         CUR        AVE        MIN        MAX

$3$DKA0:      (EMU1)   ALPHASYS           0.00       0.15       0.00       1.66
$3$DKA100:    (EMU1)   DATA1              0.00       0.00       0.00       0.00
$3$DKA200:    (EMU1)   DATA2              0.00       0.00       0.00       0.00
$3$DKA300:    (EMU1)   DATA3           2798.66    2390.29    1499.66    2798.66
$3$DKA400:    (EMU1)   ADMIN              0.00       0.79       0.00      12.00
DSA1:                  DATA1              0.00       0.00       0.00       0.00
DSA2:                  DATA2              0.00       0.00       0.00       0.00
DSA3:                  DATA3           2798.66    2390.24    1499.33    2798.66
$4$DKB100:    (EMU2)   DATA1        R     0.00       0.00       0.00       0.00
$4$DKB200:    (EMU2)   DATA2        R     0.00       0.00       0.00       0.00
$4$DKB300:    (EMU2)   DATA3        R    48.33      42.12      27.00      49.33

Same result. It has to be getting the primary data from the remote node.

I will check.

MarkOfAus · ‎08-08-2014

Yes,

Analyze/system sho dev dsa3:

Device $4$DKB300        ...  Master Member
  Index 0 Status    000000A0 src,valid
  Read Cost 000001F5    Site 00000000   SM Time Out 120 UID 1162012C 00000004
                        UCB  821AC880
Device $3$DKA300
  Index 1 Status    000000A0 src,valid
  Read Cost 00000002    Site 00000000   SM Time Out 120 UID 1161012C 00000003
                        UCB  82172700

That explains the high throughput for DSA3:

Volker Halle · ‎08-08-2014

Mark,

Same result. It has to be getting the primary data from the remote node.

No, I don't think so: the MONITOR data of EMU1 confirms, that the majority of the reads are from $3$DKA300: (the LOCAL member) and only SOME IOs are from $4$DKB300: (the REMOTE member). And the local disk $3$DKA300: is actually the one you've replaced.

A HIGHER read cost should result in LESS IOs to be sent to that shadowset member. $ SHOW SHADOW also shows the Read Cost values, so no need to invoke SDA - although that is my favorite tool ;-)

Volker.

MarkOfAus · ‎08-08-2014

"No, I don't think so: the MONITOR data of EMU1 confirms, that the majority of the reads are from $3$DKA300: (the LOCAL member) and only SOME IOs are from $4$DKB300: (the REMOTE member). And the local disk $3$DKA300: is actually the one you've replaced."

I stand corrected, you are right. It's just the master member of the set because it fell into that role when the $3$DKA300 disk died. My apologies.

Could something as mundane as a io autoconfigure help here?

Or, the solution is to replace all the other disks and get wonderful throughput back again... :-)

Volker Halle · ‎08-08-2014

Mark,

you can change the Read Cost of the shadowset members on the fly.

Maybe start to experiment on EMU2:

EMU2 $ SET DEV/READ=1 $3$DKA200:

EMU2 $ SET DEV/READ=500 $4$DKB200:

Then try the BACKUP/PHY DSA2: test on EMU2. Most of the reads should now be from the EMU1 member and should be 'slow'. If that works as expected, apply the same setting on EMU1 for DSA1: and DSA2: by reducing the Read Cost for the (remote) EMU2 members and increasing it for the (local) EMU1 members.

To reset the read cost values back to their defaults, use $ SET DEV/READ DSA2:

And no, I don't think a SYSMAN IO AUTO will help here. Probably only a reboot or even power-off/on of EMU1. But maybe can can work around this problem by modifying the Read Cost...

Volker.

MarkOfAus · ‎08-08-2014

Volker,

I did as you requested:

SET DEV/READ=1 $3$DKA200:

SET DEV/READ=500 $4$DKB200:

and for good measure:

SET DEV/READ=1 $3$DKA100:

SET DEV/READ=500 $4$DKB100:

Results (as you expected):

                            OpenVMS Monitor Utility
                              DISK I/O STATISTICS
                                 on node EMU2
                             8-AUG-2014 21:50:17.29

I/O Operation Rate                         CUR        AVE        MIN        MAX

$4$DKB0:      (EMU2)   ALPHASYS1          0.00       0.12       0.00       5.00
$4$DKB100:    (EMU2)   DATA1              0.00       0.00       0.00       0.00
$4$DKB200:    (EMU2)   DATA2              0.00       0.00       0.00       0.00
$4$DKB300:    (EMU2)   DATA3              0.33       1.73       0.00       7.00
$4$DKB400:    (EMU2)   ADMIN1             0.00       0.00       0.00       0.00
DSA1:                  DATA1             59.00      37.16       0.00      76.00
DSA2:                  DATA2              0.00      18.73       0.00      76.66
DSA3:                  DATA3              0.33       1.17       0.00       3.66
$3$DKA100:    (EMU1)   DATA1        R    59.00      37.15       0.00      76.00
$3$DKA200:    (EMU1)   DATA2        R     0.00      18.73       0.00      76.66
$3$DKA300:    (EMU1)   DATA3        R     0.33       0.98       0.00       2.33

Curiosity got the better of me:

SET DEV/READ=1 $3$DKA300:

SET DEV/READ=500 $4$DKB300:

Output:

                            OpenVMS Monitor Utility
                              DISK I/O STATISTICS
                                 on node EMU2
                             8-AUG-2014 21:51:47.31

I/O Operation Rate                         CUR        AVE        MIN        MAX

$4$DKB0:      (EMU2)   ALPHASYS1          0.00       0.15       0.00       5.00
$4$DKB100:    (EMU2)   DATA1              0.00       0.00       0.00       0.00
$4$DKB200:    (EMU2)   DATA2              0.00       0.00       0.00       0.00
$4$DKB300:    (EMU2)   DATA3             32.33       5.32       0.00      34.66
$4$DKB400:    (EMU2)   ADMIN1             0.00       0.00       0.00       0.00
DSA1:                  DATA1              0.00      24.77       0.00      76.00
DSA2:                  DATA2              0.00      12.48       0.00      76.66
DSA3:                  DATA3           2340.66     289.70       0.00    2361.00
$3$DKA100:    (EMU1)   DATA1        R     0.00      24.77       0.00      76.00
$3$DKA200:    (EMU1)   DATA2        R     0.00      12.48       0.00      76.66
$3$DKA300:    (EMU1)   DATA3        R  2341.00     289.57       0.00    2361.00

It's like the new disk has stolen all the bandwidth... :-(

Volker Halle · ‎08-08-2014

Mark,

It's like the new disk has stolen all the bandwidth... :-(

Or all the lower SCSI IDs have some problem ? How about the performance of $3$DKA400: ?

The experiments with changing Read Cost have shown, that this would be a valid, quick and safe workaround to decrease your backup runtime, until this problem can really be solved.

Volker.

MarkOfAus · ‎08-08-2014

Volker,

"Or all the lower SCSI IDs have some problem ? How about the performance of $3$DKA400: ?"

Nope. I recall I tested it. It came up as 81. A tad better than 76 (I'm an optimist).

                            OpenVMS Monitor Utility
                              DISK I/O STATISTICS
                                 on node EMU1
                             8-AUG-2014 19:52:06.38

I/O Operation Rate                         CUR        AVE        MIN        MAX

$3$DKA0:      (EMU1)   ALPHASYS           4.00       1.10       0.00      24.66
$3$DKA100:    (EMU1)   DATA1              0.00      36.45       0.00      76.00
$3$DKA200:    (EMU1)   DATA2              0.00       8.28       0.00      76.66
$3$DKA300:    (EMU1)   DATA3              0.66     551.49       0.00    2834.33
$3$DKA400:    (EMU1)   ADMIN             21.33       7.75       0.00      81.00
DSA1:                  DATA1              0.00      36.45       0.00      76.00
DSA2:                  DATA2              0.00       8.28       0.00      76.66
DSA3:                  DATA3              0.66     548.33       0.00    2834.33
$4$DKB100:    (EMU2)   DATA1        R     0.00       0.00       0.00       0.00
$4$DKB200:    (EMU2)   DATA2        R     0.00       0.00       0.00       0.00
$4$DKB300:    (EMU2)   DATA3        R     0.66      11.09       0.00      50.33

I really appreciate all your help, Volker. Thank you.

I will use your suggestion of setting the read cost appropriately to preference the remote disks. That is a great solution/workaround.

(If the moderators and yourself, Volker, don't mind, I will leave this as unsolved and await the resolution of the case from HP to hopefully resolve this conundrum).

Cheers,

Mark.

Just a note for anyone reading this, the command to reset the read cost is:

SET DEVICE/READ=1 DSA1:

You may specify any read cost because the O/S ignores it and resets the read costs for each device in the shadow set to its default.

Volker Halle · ‎08-08-2014

Mark,

now if ALL disks on that SCSI bus EXCEPT the newly replaced one perform badly after the replacement, you might also have to ask yourself: is the replacement disk $3$DKA300 itself causing the problems ? Is that disk of the SAME type as the previous one ?

Volker.

MarkOfAus · ‎08-08-2014

Hi Volker,

"now if ALL disks on that SCSI bus EXCEPT the newly replaced one perform badly after the replacement, you might also have to ask yourself: is the replacement disk $3$DKA300 itself causing the problems ?"

Indeed.

Perhaps, come Monday morning (it's Saturday morning here now), I could take that disk back out of the shadow set and see what happens running some backup tests?

"Is that disk of the SAME type as the previous one ?"

NO. When HP sent me the replacement the box contained a note stating this disk is a comparable replacement to the disk that has failed. It was a different disk. It had different sectors per track and tracks per cylinder but the general capacity listed as 36GB was the same.

Our contract specifies ours are 3R-A3079-AA, but the replacement was not that. I checked this even before I saw the note in the bottom of the box.

What's more the normal procedure for returning the bad disk was skipped; they didn't want it. To quote the person:

"The part is a non returnable therefore we no longer require the part to be returned."

Cheers,

Mark

Dennis Handly · ‎08-09-2014

>I attempt to attach the .csv from T4 to which this junk software says:
>"The contents of the attachment doesn't match its file type."

I also get that attachment error and Excel likes that file fine. Your .zip solution is a good workaround.

Forum problems should be reported here:

http://h30499.www3.hp.com/t5/Community-Feedback-Suggestions/bd-p/community-feedback-suggestions

>If ... Volker, don't mind, I will leave this as unsolved ...

You can still assign Kudos to each helpful answer by clicking on the Kudos star.

Hoff · ‎08-12-2014

What specific sort of disk is it? SHOW DEVICE /FULL $3$DKA300: will usually give some sort of an identity for the device, otherwise there are some SCSI-level tools to gather information.

With V7.3-2, you have access to dissimilar device shadowing (DDS) and dynamic volume expansion (DVE), two features that are worth enabling when you next have some scheduled downtime or can otherwise dismount the shadowsets briefly.

36 GB disks are ancient, too. I've been scrounging new-old-stock 146 GB SCSI disk drives for ~US$40 for a while now. Amazon has available HP 72 GB 15K SCSI disks in what is likely the appropriate "Universal" disk sled for that AlphaServer DS25 for ~US$15, with free shipping. Several different 146 GB 15K HP drives on offer for ~$37 to $40 over there, too.

I'd look to patch to current V7.3-2 (or to upgrade), enable DDS/DVE on the disks, and start a migration to those 72 GB or 146 GB drives, and then let HP know you've upgraded your storage for purposes of the support contract.

MarkOfAus · ‎08-14-2014

The situation was finally resolved.

I placed a support call and after some procrastination by the people at HP that do such diagnosis they came to the conclusion that the tape drive controller needs replacing. What?!?!?!

This was because they saw errors logged in ERRLOG.SYS at the time I hot-swapped the drive. This was DESPITE the fact I told them, and showed them, the evidence I assembled above indicating all BUT the new disk are performing badly.

A HP Engineer came to the site bearing a replacement SCSI controller for the Tape drive.

Before he replaced the card he said the hot-swap was undoubtedly the problem. Other systems with RAID controllers we have are more forgiving of hot-swaps. Even though the hardware supports hot-swap, it seems OpenVMS doesn't necessarily behave well afterwards (I am paraphrasing the engineer).

So, a reboot solved the problem.

Here is the final output from running BACKUP/PHY/NOCRC/GROUP=0 disk: NLA0:x.x/SAVE on all disks in the system after a system reboot:

                             OpenVMS Monitor Utility
                               DISK I/O STATISTICS
                                  on node EMU1
                             14-AUG-2014 00:03:54.70

I/O Operation Rate                         CUR        AVE         MIN        MAX

$3$DKA0:      (EMU1)   ALPHASYS        2403.66     272.33        0.00    2404.00
$3$DKA100:    (EMU1)   DATA1              0.00     429.30        0.00    2519.33
$3$DKA200:    (EMU1)   DATA2              0.00     297.22        0.00    2526.33
$3$DKA300:    (EMU1)   DATA3             34.00     341.23        0.00    2674.33
$3$DKA400:    (EMU1)   ADMIN              0.00     428.48        0.00    3497.00
DSA1:                  DATA1              0.00     429.30        0.00    2519.33
DSA2:                  DATA2              0.00     297.22        0.00    2526.33
DSA3:                  DATA3             34.00     341.14        0.00    2674.33
$4$DKB100:    (EMU2)   DATA1        R     0.00       0.15        0.00      12.00
$4$DKB200:    (EMU2)   DATA2        R     0.00       0.28        0.00       5.00
$4$DKB300:    (EMU2)   DATA3        R    34.00      24.63        0.00      50.33

So, basically, a disk can be hot-swapped into the DS25E (using the on-board motherboard controller) but a reboot needs to be scheduled for everything to get back to normality.

A partial solution, until the reboot can be scheduled, and to reduce the slowness of a backup to tape is to change the read_cost of the disks that make up the shadow set to bias them towards the remote server.

Thank you to Volker for his invaluable help and partial solution/interim work-around.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Extremely slow backup to LTO Tape after replacement of hot plug (shadowed) internal SCSI disk

Extremely slow backup to LTO Tape after replacement of hot plug (shadowed) internal SCSI disk