Storage Boards Cleanup
To make it easier to find information about HPE Storage products and solutions, we are doing spring cleaning. This includes consolidation of some older boards, and a simpler structure that more accurately reflects how people use HPE Storage.
Tape Libraries and Drives
cancel
Showing results for 
Search instead for 
Did you mean: 

Interpreting Library Tools reports

berk basarir
Occasional Contributor

Interpreting Library Tools reports

Dear all,

I'm performing some checks on L&TT reports to pinpoint possible problems with a LTO drives.

From previosu posts, I already know write error counters are a good start to check. Is there a guideline that deals with interpreting the read error counters?

Also can read/write error rate logs be viewed for recent problems, or do they point out a collection of past counters?

Thaks in advance,

Bërk
5 REPLIES
Richard Bickers
Trusted Contributor

Re: Interpreting Library Tools reports

Hi Berk,

We have moved away from looking only at error rate counters because they measure only one aspect of the health of the drive.

The best measure for LTO drives is what we call 'capacity loss'. This is the lost capacity caused by the drive re-writing data that has not gone down perfectly enough. Any doubt, the data is re-written.

This takes account of error rate, clogged heads, servo issues, temperature etc. And because it's measured by the read-while-write heads which are affected by cross-talk, it's even more conservative than normal reads.

We moved to LTO reports to make the whole business of measuring drive health much easier. We translate all key parameters (such as capacity loss) into a normalised margin where 100% margin is production quality and 0% marging is equivalent to 'still works fine but only just'. Anything less than zero should be followed up.

If you have concerns over drive health you can just look at the Drive Health section of the report (LTT support ticket) and see what the Device Analysis rules say and what the margins are like. The rules analyse the internal fault logs which give a great health history of the drive but are very difficult to navigate manually. This will give the most accurate assessment of drive health from the report. We use it ourselves.

And the best way of actively assessing health is to run the drive assessment test which combines all of the above with an active write/read test on a known tape.

I could talk you through the whole error rate question but it's complex and only part of the picture.

Can you get what you need from the Drive Health section? If not, we'd like to know what the gaps are so we can look into filling them.

If you like, you could zip and attach the report and we'll take a look.

Hope this helps.

Richard (LTT)
It's more interesting when it's gone wrong
berk basarir
Occasional Contributor

Re: Interpreting Library Tools reports

Hi Richard,

Thank you for your reply. Here are the details I've extracted for 3 tape devices. The one that reported error was with the id 0.1.0 ,shown in column 2. I'll attach the support ticket below as well.


value drive (3/0.2.0) drive (3/0.1.0) drive (3/0.3.0)
(read errors)
reread C2 4 4922 27

(read errors)
Datasets read 52 49628(*) 380


(write errors)
CCQ retries 000C9686 06E0A4C1 3EEC1

(write errors)
CCQ written 289678D2 36E6B9F0 11632639

% CCQ retries
/written % 0.12 % 12.5 % 0.088


rewritten CCQs - 5355 4423
CCQs written - 6066048 6052096

% rewritten CCQs
/CCQ written - % 0.78 % 0.78


Since the tape drive reported no error with other cartrdiges I'm confident this error is caused by a faulty cartridge.

Kind regards,

Bërk
Richard Bickers
Trusted Contributor

Re: Interpreting Library Tools reports

Hi Berk. You may have only attached the header of the ticket.

A ticket is made up of a header and a datafile. Easiest way is to 'save as' into a directory that you can name and you will see the two files. Just zip up both and attach and I'll get everything then.

This will be made easier in the next release!

Richard.
It's more interesting when it's gone wrong
berk basarir
Occasional Contributor

Re: Interpreting Library Tools reports

Hi,

Sorry for the confusion, I also add the fomatted columns:

read errorsreread C2 4922
read errorsDatasets read 49628
unreadable dsets 1

write errorsCCQ retries 06E0A4C1
write errorsCCQ written 36E6B9F0
% CCQ retries/written % 12.5

rewritten CCQs 5355
datasets written 47391
CCQs written 6066048
% CCQ rewitten/CCQ written % 0.78

Kind regards,

Berk
Richard Bickers
Trusted Contributor

Re: Interpreting Library Tools reports

Hi Berk.

This is an interesting ticket. It does show some element of retrying and the odd read error with this particular tape but there's no record of that error being reported to the host. The Device Analysis didn't pick up on anything serious. Did you get backup failures with this tape?

In comparison, this tape has not performed so well as other tapes (there are a few retries) and these issues have been picked up in the Cartridge History section of the ticket which does suggest you discard it.

It may be that the read error was a read-ahead error and therefore could be retried successfully if the data was actually needed.

To me this re-inforces the strategy of pulling the ticket and viewing the results rather than focussing in on one particular metric which could be misleading.

Are you reacting to specific issues here or trying to develop a process for finding bad tapes?

Cheers, Richard.
It's more interesting when it's gone wrong