ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

CL380 and CR3500 drive failures

Joey Albert
Advisor

CL380 and CR3500 drive failures

I posted a similar query a few weeks ago. We have a CL380 with a CR3500 controller, RAID 5, with a hot spare (translating to 5 drives live, 1 hot spare dark). We attempted to upgrade the storage from 36 gig drives (which have been working for us for a long time) to 72 giggers. The first attempt, we built the array and, once up and running, we began restoring from tape. 80% through, drives 3 and 4 died and, since two drives failed, the server crashed since we only have one hot spare. Both drives that failed on the array had RED LIGHTS. We called HP/COMPAQ and, after three days, we finally got the two replacement drives.
This weekend, we put in the new replacement drives, built the array again and did a restore. All seemed to go well. We let a day or two pass. This morning, I come in and see that the third drive has died again (this is one of the replacements HP/COMPAQ sent) with our hot spare kicking in. Now we are working on borrowed time hoping that another drive does not fail or we are up a creek. I tried unseating the drive that failed, re-seating it and it is still dark indicating no power to it (dead drive)?

I am currently on hold (phone) with HP/COMPAQ (have been for about 30 minutes now) to call for ANOTHER replacement drive that will hopefully get here tomorrow before another drive goes south (which I'm storming the heavens won't happen)...

I find that VERY unusual that three of the HP/COMPAQ drives died. These are:
Part_number:289042-001 Part Description : HP SPS-DRV, 72 Gig, Ultra320, 10,000 RPM hard drives. I know that drives fail but three out of four? Could this be a bad slot (number three) on the CR3500? I know that there are backplane issues with the DL380, the close brother of the CL380 but I don't think the DL380 uses the CR3500 that the CL380 does.

The CR3500 has the latest firmware, the drives all have the same HPB5 firmware versions. We have installed the latest COMPAQ ROMPAQs for the CL380. This, I do not think, relates to our Server OS which is Netware 5.1 SP7 but I'll throw it out there anyway.

Anyone have professional conjectures on what is going on?
7 REPLIES
JohnWRuffo
Honored Contributor

Re: CL380 and CR3500 drive failures

If new drives are failing on the shared storage cage over and over in the same drive slot, I would certainly suspect the backplane board.

You can order the Shared Storage Interconnect Board (Spare Part Number 402583-001) from the Parts Store for like $110.00.
http://partsurfer.hp.com/cgi-bin/spi/main?sel_flg=modinfo&model=PROCL380

Here is a pick of the board I suspect:
http://partsurfer.hp.com/cgi-bin/spi/showphoto?partnumber=402583-001
Enjoy!
__________________________________________
Was the post useful? Click on the white KUDOS! Star.

Do you need help with your HP product?
Try this: http://www.hp.com/support/hpgt
Joey Albert
Advisor

Re: CL380 and CR3500 drive failures

Thanks John.

Not 1 hour later, another drive on another slot (slot 2) failed.

So the first time, two drives on slots 3 and 4 failed.

I replaced those drives with two replacements that HP/COMPAQ sent in the same exact slots (3 and 4), built the array from scratch. This morning the drive in slot 3 failed (kicking in the hot spare). Not an hour laters, the drive in slot 2 failed taking my production server down.

HP is sending me a new drive (which I doubt is bad because what are the changes that 4 out of 8 drives goes bad?), a new power supply for the shared storage cage AND a new interconnect board for the shared storage cage (which looks like the part link you sent me might be it).

Looks more and more like the backplane/interconnect board but how weird is that? Now, I have my old 36 Giggers in there and it is fine. Of course, on the Six 36 giggers (the old drives), 1,2, and 3 are HOT, 4 is the spare, 5 and 6 are hot. In the new config with the 72 Giggers, 1 through 5 are hot and 6 is spare.

I checked the specs on both the 36 and 72s and they have identical power consumption ratings. If it is the power supply, then it should fail regardless of whether the 36s or 72s are in.

JohnWRuffo
Honored Contributor

Re: CL380 and CR3500 drive failures

Joey:

Yes, I would suspect the 36GB drives will fail soon too. I hopy you have a real good backup!
Enjoy!
__________________________________________
Was the post useful? Click on the white KUDOS! Star.

Do you need help with your HP product?
Try this: http://www.hp.com/support/hpgt
Joey Albert
Advisor

Re: CL380 and CR3500 drive failures

John:

I doubt they'll fail because these are the exact same drives that we've been using for about 3 years now.

I tried the upgrade a month ago to the 72s and they failed. I put these same 36s back and we've been okay up till last Saturday when I tried the upgrade to the 72s again.

Never had a failure on the 36s on the same backplan/interconnect board. Perhaps because the 4th is dark (hot spare) while, with the new 72s the 6th was dark? It may be the actual 4th slot has a problem that could cause the backplan/interconnect board to fail randomly.

It's weird but I guess it's not really comparing apples to apples since they are not the same drives and they are not configured they same way.

I am almost curious to find out what happens if I configure the 36s to have the sixth slot as the dark spare instead of the fourth slot --- or perhaps the new 72s to have the fourth slot dark hot spare instead of the 6th slot. That's a different story, though.
Joey Albert
Advisor

Re: CL380 and CR3500 drive failures

John:

I found out from HP a couple of things. The drives we were recommended were Ultra320s but, in actuality, only the SCSI-Ultra3s are certified compatible with the CL380.

Additionally, our CR3500 did not have the latest firmware (it had x2q [27 Sep 00] instead of x2RA [5 Mar 02]).

HP's revision history states:
» version X2R A (5 Mar 02)
Updated the firmware to optimize the negotiation transfer parameters for certain drives.
Files contained in this SoftPaq:
x2r-1.fdi
» version X2Q (27 Sep 00)

It does state that the newer version was "Updated the firmware to optimize the negotiation transfer parameters for certain drives" but poorly indicates as to what those 'certain drives' are (are they the U320s as well?).

We are playing it conservatively. HP support (Phil - an outstanding support person) will do a straight swap: Six Ultra320 drives for Six SCSI-Ultr
JohnWRuffo
Honored Contributor

Re: CL380 and CR3500 drive failures

Wow... fantastic info Joey; Thank you for the feedback. Hehe; figures HP will give you the best answer to your troubles.

G'luck with the replacements!
Enjoy!
__________________________________________
Was the post useful? Click on the white KUDOS! Star.

Do you need help with your HP product?
Try this: http://www.hp.com/support/hpgt
Joey Albert
Advisor

Re: CL380 and CR3500 drive failures

Everything is great now. After two failed attempts of upgrading to the Ultra320s (you CANNOT get Ultra3 drives anymore if your life depended on it), the upgrade of the firmware on the CR3500 to the latest build made it work perfectly with the Ultra320s (even if they are not spec certified by HP to work with the CL380s). I had the Enterprise Cluster team (escalations) in Canada working on this with me for two weeks. They built an identical server, with the same specs and drives including the Ultra320 drives. They pounded on the drives for a week and so no failure. They downgraded the firmware and it led to server lockups but no drive failures. I bit the bullet (it was either that or spend for another cluster server and install times) and upgraded over the weekend. 72 hours has passed and the drives are holding up. Hallelujah.