Tape Libraries and Drives
cancel
Showing results for 
Search instead for 
Did you mean: 

SCSI Log Sense -- Error Counter Pages

SOLVED
Go to solution
Brian Eickhoff
Occasional Advisor

SCSI Log Sense -- Error Counter Pages

I have been evaluating two different HP LTO tape drive models (Ultrium 448 & 960) in addition to two other models from different vendors. Comparing these four drive models, the HP tape drives are the only ones that seem to accumulate a large number of errors in the SCSI error counter pages (hundreds or thousands of errors for a large transfer). Specifically, these Log Sense error counter pages are:

write errors (page code 02h)
parameter codes:
0000h - Write errors corrected w/o delay
0004h - Total number of write retries

read errors (page code 03h)
parameter codes:
0000h - Read errors corrected w/o delay
0004h - Total number of read retries

Since the exact definition of these error counters is not part of the SCSI standard, can someone tell me how they are implemented on the HP Ultrium tape drives? Despite having the high error count, the drives perform as advertised. However, I would feel better about the reliability of these HP LTO drives if I knew what these errors really meant. I have not been able to find a HP SCSI Reference document with answers to any of these questions. Does anyone else have experience working with HP's SCSI Log Sense data? Thanks!
14 REPLIES
Richard Bickers
Trusted Contributor

Re: SCSI Log Sense -- Error Counter Pages

Hi Brian,

The logs have 7 parameters each and the HP drives only use a subset. You can use HP Library and Tape Tools to extract and view the data in a support ticket though it sounds like you're using a more direct SCSI approach.

Write error counters log page:
0 Errors corrected without substantial delay - not used - are you finding numbers in here?
1 Errors corrected with possible delays - not used - are you finding numbers in here?
2 Total Sum of parameters 3 and 6
3 Total errors corrected - The number of data sets that needed to be physically rewritten through repositioning - only happens if 4m of CCQ rewrites haven't been successful
4 Total times error correction processed - Number of CCQ sets rewritten - a CCQ is a piece of a dataset that can be re-written. It uses more tape but allows the writes to continue streaming. It's like sparing over small sections of media
5 Total data sets written - 400K for LTO 2, 1.6M for LTO 3
6 Total uncorrected errors -The number of data sets that could not be written - even after CCQ re-writes and retries. I.e. A write failure.

The drives use read-while-write to measure the quality of the data on tape as it goes. The usual impact of poor media, or dirt is to use more tape by writing extra CCQs. We call this capacity loss. This is the most accurate measure of write quality as it takes account of other factors such as tracking which can also result in CCQs being re-written. This can be calculated by comparing CCQs written with CCQ retries from the LTT support ticket in the 'write error rate log'. Usually <1% but can vary. 5% is fine, 10% is unusual, it will still work upto 50% (though we consider 20% to be returnable).

Even at 1% you will normally see lots of CCQ re-writes for large transfers. Don't worry!

Read error counters log page:
0 Errors corrected without substantial delay - not used - are you finding numbers in here?
1 Errors corrected with possible delays - not used - are you finding numbers in here?
2 Total Sum of parameters 3 and 6
3 Total errors corrected - The number of data sets that were corrected after a
physical read retry
4 Total times error correction processed - Number of times logical (C2) error correction is invoked - i.e. some of the write redundancy is used
5 Total datasets processed (read)
6 Total uncorrected errors - The number of data sets that could not be read after
retries - i.e. a read failure.

I wouldn't expect to see too many C2 error corrections but 1 per 100 datasets is reasonable.

It's really hard to compare different vendors drives with these measures because they are so vendor specific but I hope the above helps you untangle some of the numbers coming back from the HP drives. We watch these figures very carefully during production and also use the in support (via LTT) to determine drive health. Watch out for 'LTT reports' coming in Febrary (LTT 4.0 SR1) which translates all of this into english!

Good luck with your selection. Be interested to know how you get on.
It's more interesting when it's gone wrong
Brian Eickhoff
Occasional Advisor

Re: SCSI Log Sense -- Error Counter Pages

Hi Richard,

Thank you for the fast response. I will look into computing the retry rate to compare with the typical retry rates that you have given. You may be interested to know that I am finding numbers for parameters 0 and 1 in the read and write log pages. Parameter 0 is always less than parameter 4, but still significantly large. It is rare that I see any numbers in parameter 1, but I have seen a value of "1" which seemed to correspond to the definition of a "possible delay." In other words, there actually was a delay in the throughput when I saw this value. Should I be ignoring these numbers?
Richard Bickers
Trusted Contributor

Re: SCSI Log Sense -- Error Counter Pages

Hi Brian,

I'll need to check with the firmware team when we're all back in the office about what the numbers mean in the first two parameters. They're most likely derivatives of the other numbers - we don't use them ourselves (i.e. HP support, LTT). Let you know.

I'd stick to the CCQ and C2 counts for judging write and read health for the HP drives. Trouble is you won't be able to make a direct comparison with the other drives because the use of these parameters is not defined to that level. It's whatever the vendor felt constitued 'with delay' and 'without delay' meant.

The best measure of read and write quality for HP Ultrium drives will be coming with 'LTT reports'. If you're interested we can let you have a beta copy some time in January. We're looking for customer feedback and this may also be an easy way for you to make your measurements.
It's more interesting when it's gone wrong
Brian Eickhoff
Occasional Advisor

Re: SCSI Log Sense -- Error Counter Pages

Hi Richard,

Thanks again for the help. If your firmware team can provide any additional information regarding the implementation of these error counter pages (specifically the first two parameters), it would be greatly appreciated. Thanks!
Richard Bickers
Trusted Contributor

Re: SCSI Log Sense -- Error Counter Pages

Hi Brian,

I've spoken to the firmware team and they are going to check the details for me. It seems the first two parameters are either the same or derived from the subsequent ones which is why we don't use them in support.

We'll get the definitions for you but suffice to say:
* It is normal to have numbers in there (I hadn't realised...)
* We focus on C2 invocations for reads and CCQ re-writes for writes in support
* In both cases you'd expect <1% per dataset for a top performing drive but it'll still work fine upto 20% and even beyond.
* The format and the way the drives work is designed to work with large numbers of errors (i.e. poor media). Typically the drives operate well within their margin. Exceptions tend to be as a result of misuse or contamination.
* The values from an HP drive will be completely different from another vendor's drive so unfortunately you can't compare like for like.

I'd be interested to see what sort of figures you're getting and will be able to comment on the health of your drive/media (which is what LTT will do automatically for you). Perhaps you could post an example or two.
It's more interesting when it's gone wrong
Brian Eickhoff
Occasional Advisor

Re: SCSI Log Sense -- Error Counter Pages

Hi Richard,

I have attached a few lines from a log file generated by my test program. The data is from a typical write sequence on a HP Ultrium 960 LTO 3 tape drive. The results are very similar on a HP LTO 2 drive, and in total I have seen these error counts on 10 different HP LTO drives.

My test software was developed for use in environmental tests with LTO drives in hope that I would be able to see when drive performance was beginning to deteriorate. Data transfer rates can reveal a lot of information--especially when a drive is able to operate through an external event with a minimal drop in data throughput. However, error counts are another parameter that can be watched to monitor the drive for failures. With the HP drives I have been using, monitoring error rates has been less useful because they seem to accumulate logged errors in even the most ideal operating environments. (Please do not think that I am using damaged hardware--I have seen this behavior out of the box.) This has led me to wonder if these errors are real, or if HP has implemented these error logs vastly different from other LTO manufacturers. (I do understand that you cannot make a direct comparison of these error counts.)

The upper half of my attached data shows the start of a long write transfer. Notice that errors begin to accumulate almost immediately after data has been written to the tape.

The lower half of my attached data is from the same write transfer--now after transferring more than 24 GB of data. I chose to include this portion of the transfer because it shows the appearance of a number in parameter 1 (write errors corrected with possible delay). As defined in the SCSI specification, this happened after a drop in throughput caused by an environmental shock pulse. This leads me to believe that this particular register (Parameter 1) has been implemented in your current firmware revision.

As for parameters 0 and 4, these numbers continue to grow throughout the entire transfer. The unfortunate thing is that at the time of my testing I did not collect data on the number of data sets written. Therefore, I cannot derive any percentages to compare with the acceptable values that you have given me. Maybe you can make an estimate based on the amount of data transferred?

Again, any information you can provide about these error counters would be much appreciated. Your products work as advertised--it was just alarming to see these large numbers when other drives have few to none. Now that I have this error data, it would be nice to have an explanation to account for these differences. Preferably, I would like to have an explanation more technical than "different vendor, different implementation."

I'm also open to the idea that maybe HP is more "honest" about the number of times error correction is invoked and it's really the other drives' firmware that is reporting low values. Again, I understand it's all about the implementation. However, if these numbers are something other than error counts, I would suggest that they be corrected, removed, or openly documented.

Let me know if you have any questions about my attached data.

Thanks again,
Brian
Richard Bickers
Trusted Contributor

Re: SCSI Log Sense -- Error Counter Pages

Hi Brian,

Very interesting data. You've got a neat test system going there.

I can derive number of datasets from the volume of data. It looks like you're using non-compressing data as the data rate levels out at 80MB/s which is the native rate of our Ultrium 960. Please let me know if this assumption is wrong.

The LTO3 dataset is 1.6MB so 24.5GB of data equates to 15,300 datasets. From this I'd expect less than 200 CCQ re-writes in a good drive (<~1%) though anything upto 4,000 would be reasonable but your figures are way higher than that. They're so high I'm getting the feeling there's something weird going on here.

I think it would help me if you could pull an LTT ticket and attach it to this thread. That will give me the full story on your drive and maybe I can work out what's going on here. I'll also go back to the firmware team and see if I'm missing something...

To pull an LTT ticket, install and run from www.hp.com/support/tapetools and then under the support tab hit 'view ticket' and then 'save as' with a name you choose. Please attach the .dat file that starts with your chosen name in the logs directory of the LTT install directory.

You've got me thinking now....
It's more interesting when it's gone wrong
Brian Eickhoff
Occasional Advisor

Re: SCSI Log Sense -- Error Counter Pages

Hi Richard,

Your assumption was correct that I was using non-compressible data, and the transfer rates reflect this. More precisely, I just coded a simple utility to generate a random file (of random bytes) of a given length to be used for the test.

I was responsible for writing the test program that generated the output data as seen in the attachment of my last response. I will admit that there is a possibility of a bug in the test code, but I don't see anything like this with drives from other vendors when running the exact same test. Other drives may generate a few errors, but they are usually very small in number and/or associated with external events. This gives me assurance that I am looking at the correct memory locations because the errors usually occur at logical times on non-HP drives. From the information that you have provided for me, HP adheres to this same log page mapping. So...?

The operating system used for out tests was Solaris 10. I know this may be a rival operating system, but the built-in "st" driver seems to recognize and support HP LTO drives. Our application also requires it. I'm not sure if I will be able to "pull an LTT ticket" because of the operating system we are using, but I will look into it. I can also tell you that we have run HP drives on at least two different processor/SCSI controller configurations and have seen the same results. The drives perform well (in terms of data throughput and overall usage)--they just show a lot of write/read errors in the log sense pages.

In case you were wondering, we see read errors in addition to the write errors. These error counters accumulate in about the same way as the write error counters. Usually there are less read errors than write errors.
Richard Bickers
Trusted Contributor

Re: SCSI Log Sense -- Error Counter Pages

Thanks Brian. Unfortunately we don't support LTT on Solaris which is a problem here.

I've got your data and will review it with the firmware team. I can probably run some similar tests myself and compare.

I agree that it all seems to work with other drives - but you've left me puzzled with the numbers coming back.

I'll have to get back to you.
It's more interesting when it's gone wrong
Richard Bickers
Trusted Contributor
Solution

Re: SCSI Log Sense -- Error Counter Pages

OK, I got it.

I was getting mixed up between datasets and CCQs. The capacity loss calculation is:

Re-written_CCQs/Good_CCQs * 100%

The key point is that in LTO 3, there are 128 CCQs per dataset (different figures for LTO 1 and 2).

So in your example, you'd written 24.5GB which is 15,300 datasets - which is 2M CCQs. In that time you'd re-written 14,000 CCQs which works out as 0.7%. Anything less than 1% is very good.

If you want to track capacity loss on the fly you can use:

Param_4/(Param_5 * 128) * 100%

General rule of thumb:
- <1% very good
- 1-5% normal
- 6-10% less normal but don't worry
- 10-20% maybe clean the heads - will probably self-clean anyway
- >20% we start to worry if it stays up here. Think contamination.

Sorry for the confusion. I got there in the end...

Good luck with the rest of your testing.
It's more interesting when it's gone wrong
Brian Eickhoff
Occasional Advisor

Re: SCSI Log Sense -- Error Counter Pages

One more piece of needed data...

What is the number of CCQs per dataset for LTO 2?

Thanks!
Richard Bickers
Trusted Contributor

Re: SCSI Log Sense -- Error Counter Pages

No worries. There are 64 CCQs in a dataset for LTO 2. Also true for LTO 1. We made the change for LTO 3 as the datasets are bigger.

Cheers,

Richard.
It's more interesting when it's gone wrong
Brian Eickhoff
Occasional Advisor

Re: SCSI Log Sense -- Error Counter Pages

Have you learned anything more about what the numbers represent in Parameter 0? The ones that are "not used"? Just wondering...
Richard Bickers
Trusted Contributor

Re: SCSI Log Sense -- Error Counter Pages

I have. Parameter 0 contains the number of datasets that had a CCQ re-write in - i.e. one or more. That's why Parameter 0 tends to be a subset of Parameter 4.

We're interested in the total number of CCQ re-writes rather than this subset so the value isn't all that useful.

I've got a full run down from the firmware team on all of these parameters now. The rest of them are even more obscure... Shows what happens when a generic SCSI spec is left to interpretation!
It's more interesting when it's gone wrong