Re: Backup block size

Wim Van den Wyngaert · ‎08-23-2005

Are there any disadvantages of spcifying a block size of 64K when doing backups to DLT* ?

And for remote backups (backup to a T2T passing the save set to a convert command) ?

And why is the default not increased from 8K to 64K ?

Wim

Wim

Uwe Zessin · ‎08-23-2005

For me, the main disadvantage is that I cannot $COPY such a saveset from tape to disk and treat is as a container file. When I tried it the last time, the record size was limited to 32767 bytes
(did you know OpenVMS was a 16-bit OS ? ;-). So I use 32256.

.

Karl Rohwedder · ‎08-23-2005

Wim,

as far a I know, when specifying blocksizes greater 32k it would no longer be possible to copy the saveset from tape to disk.

regards Kalle

Ian Miller. · ‎08-23-2005

I think default block size is going to be increased in a future version of VMS.

For remote backups you are limited to 32K due to RMS limitations.

____________________
Purely Personal Opinion

David Jones_21 · ‎08-23-2005

(aside) Do modern tape technologies actually still have interrecord gaps and tape marks or do they coalesce writes into large aggregate structures from which they emulate having tape-like records?

I'm looking for marbles all day long.

Mike McKinney · ‎09-08-2005

Wim,

For years I've been specifying /BLOCK=65535/GROUP=20 with great results on DLT. The 65535 was suggested by DEC back in VMS 4.n. And the group size puts 20 blocks together between IRG's (and yes, IRG's still exist - at least on DLT.)

There are other changes that you can make, if you are interested, to make backups really fly. Increase working set, BIOLM, DIOLM and etc on the BACKUP account and you can really get some I/O's going. Let me know if you want to try it and I'll give you the values that I use.

Mike

Uwe Zessin · ‎09-08-2005

No, /GROUP_SIZE specifies how many blocks are combined in a redundancy group. It writes that many blocks, then creates a recovery block (similar to RAID5) and writes it to tape, too.

.

Ian Miller. · ‎09-08-2005

"(did you know OpenVMS was a 16-bit OS ? ;-). " not really but parts of the file system where inherited from one (PDP11 RSX).

____________________
Purely Personal Opinion

Uwe Zessin · ‎09-08-2005

Guess why there as a ";-)" attached.

I know about the RSX history and why you can only use a 15-bit record length on disk.

.

Wim Van den Wyngaert · ‎09-08-2005

For the youngsters that don't know what the old guys are talking about :
http://www.village.org/pdp11/faq.html

Wim

comarow · ‎09-09-2005

I notice there were some references to /group.

The latest reference URL on backup performance suggests setting /group=0.

The entire recommendations for backup performance have changed remarkably. Especially the reduction diolm.

This is especially true with thrashing that occurs in SAN cache with a high diolm.

To see the new docummented suggestions see

http://h71000.www7.hp.com/doc/82FINAL/aa-pv5mj-tk/00/01/117-con.html

Rob Young_4 · ‎09-09-2005

> The entire recommendations for backup
> performance have changed remarkably.
> Especially the reduction diolm.

>This is especially true with thrashing that
>occurs in SAN cache with a high diolm.

>To see the new docummented suggestions see

>http://h71000.www7.hp.com/doc/82FINAL/aa-pv5mj-tk/00/01/117-con.html

Keith Parris comments at length in c.o.v
about this:

http://tinyurl.com/ddct3

8.2 (and 7.3-2 with FIBRE_SCSI V0400 and above) does back off much more
aggressively than 7.3-1 (and vanilla 7.3-2) in the face of Queue Full
events. Brian Allison's Bootcamp presentation i222 covered this:

Prior to 7.3-1, VMS had a variable maximum I/O queue depth per LUN, in
the range of 3 to 16, based on recent I/O sizes. This caused severe
performance problems, particularly on RAID LUNs (which can have lots of
independent disk-head actuators, and we really want to keep all of them
busy for best throughput) and for large I/O sizes (which could reduce
the allowable queue depth down to 3), because of the very-small maximum
queue depths allowed. (Aside: On MSCP controllers, I've observed queue
depths in the 100s, even 1000s, without problems. But the marketplace
said Proprietary is Bad, Industry Standard is Good. And we're doing our
best to provide all the advantages of the old Proprietary soluions on
the less-expensive Industry Standard foundation, by applying our
engineering skills on the software side.)

In V7.3-1 VMS moved to a per-storage-port scheme where the host didnÃ¢??t
limit I/O queue depths until the storage sub-system asked it to back off
(via a "queue full" response). Upon receiving the "queue full" response
VMS issued no more I/Os to that port until half of the outstanding I/Os
to that port had completed, and then VMS again allowed the queue size to
build up until it got another "queue full".

Unfortunately, due to the large number of commands that can be in-flight
in a SAN, the V7.3-1 / V7.3-2 algorithm was too aggressive:
o Many mount verification messages can result when the same I/O gets a
"queue full" response several times in a row
o Performance suffers badly on the HSG when it has to return "queue
full" responses
o In extreme cases, the HSG can crash if it receives more I/O after
signaling "queue full"

In V8.2 (and 7.3-2 with FIBRE_SCSI V0400 and above), VMS moved to an
algorithm that drains 1/2 of the existing I/O requests and then allows
the queue depth to increase by 1 entry every 5 seconds. It was hoped
that this, combined with HSG ACS 8.8, would solve the problems.

Unfortunately this modified algorithm seems to have been a little too
aggressive in backing off I/O after a "queue full" condition. Currently,
once a "queue full" occurs we throttle traffic to that I/O port forever.
Traffic rates are allowed to gradually increase, but if the I/O load
ever has to throttle back, the re-ramp time is slow and impacts
performance. So FIBRE_SCSI kits are now in the works to pick a better
I/O ramp scheme.

This might explain your symptoms of slower I/O for a period of time.

How can you avoid "queue full" events? One way is to spread the I/Os
across as many controller ports as possible. If I/Os are predominantly
reads, going to a 2-member or 3-member shadowset across multiple
controllers could reduce the I/O load by as much as 1/2 or 2/3 on a
given controller port. Using host-based RAID software to divide the I/Os
across disks in different controllers can help for both reads and writes
equally (forming RAID-0 arrays, or RAID 0+1 arrays in conjunction with
Shadowing). If "queue fulls" occur mostly during Backups, reducing
process quotas for the process running Backup could help. And of course
doing as much caching as possible in the host (by using XFC, RMS Global
Buffers, database caches, etc.) can help by avoiding I/Os to the
controller as much as possible.

---

I believe the fast 7.3-1 algorithm
was/is a storage platform centric
problem (a certain
large third-party storage provider doesn't seem to have this issue).

I speculate the monkeying with the algorithm
was an outgrowth of problems similar to
this (not discounting Keith's excellent
analysis - just adding to it):

http://tinyurl.com/auapr

And the root cause analysis of the above
problem:

http://tinyurl.com/bxw4n

Anywho, the root problem that we are seeing is that our mighty ES40
running OpenVMS 7.3-1 is simply OVERWHELMING the ESA12000 with I/O
requests. The VMS Engineering person told me that some of the
performance tweaks in 7.3-1 really make VMS fly when it comes to I/O.
Now our ES40 is demanding data so fast from the HSG80s that eventually
the HSG80s tip over.

This makes perfect sense. The three times that we have been bitten by
this problem, we had *extremely* heavy I/O on the ES40. The first
time, we were running three concurrent backup streams from snapshots
AND running our month-end batch processing. The poor HSG80s could not
handle the load and gave up (which corrupted our Cache data files and
made us restore from tape).

The VMS Engineering person told me that the HSG80s have a total queue
depth (if that is the correct term) of *240* outstanding I/O requests.
After that, the controllers try to tell the host system to slow down
a little. But the ES40 and VMS 7.3-1 are hungry for more data and
finally the HSG80 faints.

First Fix: The guru from VMS Engineering asked me to check the DIOLM
setting on the account that we use to run out backup jobs. Knowing
that the HSG80s have a maximum queue depth of 240, we don't want to
bury the HSG80s any more. In Authorize, I found that DIOLM for our
backup account was set to 32767. Three backup jobs running at the
same time under that account were issuing TONS and TONs of I/O
requests and burying the HSG80s.

So, per his advice, I set the DIOLM for our backup account to "32".
This will give very good backup performance and still not bury the
HSG80s.

Second fix: The VMS Engineering guru told me to install
DEC-AXPVMS-VMS731_MSA1000-V0100 as soon as I can. It fixes a timeout
value for fibre channel read/writes. The value got set to "4 seconds"
in VMS 7.3-1 and this patch changes the timeout value back to "24
seconds". This will help the OS be more tolerant when the HSG80s are
being pokey with returning requested data.

---

Summarizing:

Be at a recently patched up rev of VMS and
you will have a throttling algorithm
in place to prevent IO nastiness.

Keith comments on controller firmware also:

[patched 7.3-2+]
"combined with HSG ACS 8.8, would solve the problems."

I'd check with engineering about 7.3-1
and IO issues and whether you would
be at risk if patched (I don't have a definitive answer).

If early or minimally patched 7.3-1 watch
overwhelming certain types of storage
back-ends by limiting DIOLM (a kludge).

Rob

Ian Miller. · ‎09-10-2005

See the 26th August entry here
http://www.eight-cubed.com/blog/

for DCL to check for QFUL. The VMS V8.2 documentation about backup quota recommendations is much improved over previous versions and also can be used on previous versions.

____________________
Purely Personal Opinion

Wim Van den Wyngaert · ‎09-13-2005

But not working on 7.3 (fc stdt/all).

Wim

Wim

Ian Miller. · ‎09-13-2005

I don't know if the FC SDA extension existed for VMS Alpha V7.3

____________________
Purely Personal Opinion

Wim Van den Wyngaert · ‎09-15-2005

FYI : I found out that Unicenter TNG performance solution 2.1 can produce a graph with tape thruput (custom graph per user).

I increased the working set for backup from 8Kpages to 32Kpages but performance didn't improve(7.3 on a 4100 with TZ88).

Wim

Wim

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Backup block size

Backup block size