Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

HSG80 ema failed disk replacement

 

HSG80 ema failed disk replacement

What is the best practice for replacing a failed disk in an HSG80/EMA enviroment. Is it neccessary to Quiesce the channel the drive is on? How does this effect the production data flow, suppose members of a mirror set are on the same channel? I have been replacing disks without doing this, so long as no rebuilding is going on and the drive is in the failedset. I have had no problems.
10 REPLIES 10
Richard W Hunt
Valued Contributor

Re: HSG80 ema failed disk replacement

Don't know about EMA but I have ESA's where I've done hot-swapping for years.

The idea of a mirror set is that you choose to never allow two members of the same set to be on the same channel if at all possible. For instance, for my ESA, the mirror sets are shift one row AND one column from each other. So DGA1 (mirror set M1) is defined based on DISK10000 and DISK20100. That way, you never lose the mirror.

I concur with the practice of mucking about using the FAILEDSET and SPARESET as ways to reconstitute mirror members on-line.
Sr. Systems Janitor
Uwe Zessin
Honored Contributor

Re: HSG80 ema failed disk replacement

But you don't need to rip a disk drive from a perfectly operating mirror-set and risk data loss! Just add the desired disk drive as a 3rd member to the set:

> set M1 nopolicy
> set M1 membership=3
> set M1 replace=DISK20100
> set M1 policy=best_performance

wait until the copying to DISK20100 has finished

> reduce DISK20000
> add spareset DISK20000 (or whatever...)
.
Jon Pinkley
Honored Contributor

Re: HSG80 ema failed disk replacement

While I agree 100% with Uwe's procedure to avoid the auto spare drive from coming into play, especially when the purpose is to make a snapshot (the device being reduced), I don't think anyone has specifically answered Daniel's question. In his case, he said a drive failed and was in the failed set, the question was whether it is ok to pull that failed drive without regard to other activity on the backend SCSI bus the failed drive is on.

Ideally you have another drive you can add to the mirror until there is a time that a stalled transaction won't be an issue, if you have a spareset, this may have already been done. Delete the drive from the failed set, then press the button, wait for the lights to flash, pull the drive and if you can replace quickly, put the replacement drive back in. That's what the button is for. Can you get away without using it? Most of the time the controller will detect the event and retry the operations, it will probably complain about an unexpected event on the bus. I've replaced a drive without quiescing the bus, but it wasn't intentional, it was carelessness, but I was replacing the drive at a low activity time, so perhaps I was lucky.

As far as the question about how quiescing the bus affects the production data flow, I can't say, since I never had activity going on that I was attempting to measure. If there are other devices on the channel that is being quiesced, that by definition is stalling all i/o to devices on that channel, so at least for cases where there is a JBOD device that isn't part of a raid1 or raid5 set, a read is done to that device, and the data does not exist in the HSG80 cache, I can see no possibility but to stall that I/O. It is less clear what will happen if the I/O was to a mirrorset where other members were available, writeback cache was available and enabled for the virtual unit. It that case, then I/O theoretically would not have to be stalled. But I don't know how the HSG80 firmware handles that case, and I don't have one to test on.

Uwe, do you know the answer to the original questions?
it depends

Re: HSG80 ema failed disk replacement

Thanks for the feedback, I suppose unlike the Smart Arrays and MSA enviroment, the older technology such as HSG80 cannot handle drive removal from a drive shelf as well, thus the need to "stall" activity. I have never had problems NOT quiescing the bus prior to removing a drive on an HSG80, but maybe I should change my procedures in the future!

thanks again!
Jon Pinkley
Honored Contributor

Re: HSG80 ema failed disk replacement

You did ask "What is the best practice for replacing a failed disk in an HSG80/EMA environment."

I doubt that the HSx controllers are more likely to have problems than MSA or Smart Arrays when you just pull a device or put in a new one. Perhaps the HSx controllers are just engineered better to allow you to do it with risk closer to zero by stopping the activity on the bus.

Is it ok to plug a flat tire? Many people do so without any problems and use that as a permanent fix, yet some people would never consider it. It depends a bit on what your perspective of what acceptable risk is.

Some people will never plug a SCSI device into an unused (non-shared) SCSI controller while power is on. That is probably best practice, but I do just that when I need to change a SCSI tape drive, and I can't shut the system down. I suppose there is a small chance that it could cause a hardware problem or conceivably even a system crash, although I have never had a problem. I do use antistatic straps when I do it, and I do have the tape drive powered off, but it still isn't "best practice".

Uwe is in a much better position than me to talk about what the risks are.
it depends
Robert Atkinson
Respected Contributor

Re: HSG80 ema failed disk replacement

Daniel, we had to do this a number of times on our HSG's, so I've attached the procedure guide we wrote.

Hope it helps.

Rob.
Highlighted
Ruud Dijt
Advisor

Re: HSG80 ema failed disk replacement

We still have 5 ema's with double HSG80's and a lot of disks. So sometimes a disk failesand you have to replace it. We run them now for 8 years.

I agree with Jon Pinkley: always sqeesz the I/O channel containing that drive and remove the drive.
But I do not agree with Jon
"if you can replace quickly, put the replacement drive back in"

If you do this and the HSG controllers are quit bussy they sometimes don't recognice the new disk and you have to do a remove/insert for a second time. So I always do this in 2 steps and 2 times a squeesz of the I/O channel.First step to remove the old one, second step to place a new one.
By the way we have at least one spare on each shelf and never place raid1 or raid 5 members on the same shelf. With policy best_performance the auto replacement toke never a spare from a shelf where already another member is on.
Jon Pinkley
Honored Contributor

Re: HSG80 ema failed disk replacement

I always do it in two steps, because I was never able to let the old spin down, and remove it before the bus got rescanned.

So I agree with Ruud Dijt, do it twice, once to remove, once to add. I know that works.
it depends
Robert Atkinson
Respected Contributor

Re: HSG80 ema failed disk replacement

Just to add to this, the official line is that you MUST allow at least 30 seconds for the array to spot and account for the change, both removing the disk and replacing it.

So, you should NEVER try and swap the disk as quickly as possible, in a single movement.

Rob.