ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

Smart Array 641 Issues (Proliant ML350 G4)

VanRaily
Occasional Contributor

Smart Array 641 Issues (Proliant ML350 G4)

Hello-

 

I'm running an old ML350 G4 server with Windows SBS 2003 and a Smart Array 641 controller on firmware 2.34 B (I know that there is newer firmware but until just recently I've had no problems).  I've experienced the following sequence of events twice now in the past month:

 

Event Log

--------------

19:26 - The device, \Device\Scsi\cpqcissm1, did not respond within the timeout period.

19:27 - Drive Array Physical Drive Status Change. The physical drive in Slot 3, SCSI Port 1 Drive 5 with serial number "XXX", has a new status of 3.

19:27 - Physical Drive on DEVICE ID 5 on Port 1 of Array Controller in slot 3 has failed. Failure Code: 0x07

19:45 - Environment Abnormality Auto Shutdown (EAAS) initiated due to thermal reasons, either resulting from the system overheating, or from the loss of cooling.

 

Smart Array Log

----------------------

19:27:13 - SCSI bus fault occurred on Storage Box box 0, , Port 0 of    Array Controller  in slot 3.    This may result in a "downshift" in transfer rate for one or more hard drives on the bus. 

19:27:13 - Physical Drive on DEVICE ID 5 on Port 1 of    Array Controller  in slot 3 has failed.    Failure Code: 0x07.

19:27:13 - Logical Drive 2 of    Array Controller  in slot 3 has changed from status code 0 to status code 3.

20:07:59 - The Event Notification driver Cpqcisse.sys of    Array Controller  in slot 3 has started.

20:08:29 - Logical Drive 2 of    Array Controller  in slot 3 has changed from status code 3 to status code 4.

20:08:29 - Logical Drive 2 of    Array Controller  in slot 3 has changed from status code 4 to status code 5.

21:25:00 - Logical Drive 2 of    Array Controller  in slot 3 has changed from status code 5 to status code 0.

 

As indicated by the 19:45 event, the server overheats and then restarts.  It's worth noting that this does happen once in a while, but it always accompanies this drive failure.  After it automatically restarts it restores that failed drive and everything is fine.

 

It's also worth noting that I get the cpqcissm1 event ("The device, \Device\Scsi\cpqcissm1, did not respond within the timeout period.") maybe once or twice a week.

 

Now, for a little history:

At the beginning of this year one of my hard drives truly did fail and I replaced it with a "new" one (I say that because it was used but new to the server).  That fixed the failed drive, obviously, but ever since that point the cpqcissm1 events began appearing.  This may be coincidence and not consequence, but it could also be indicative of a bigger problem--a problem that these recent events are revealing. (Also, the original failed drive is not the same one as the one from the recent events).

 

I'm more of a server administrator by necessity than by education, so I've come here for your advice: what do you think is going on?  I'm pretty sure that there's a hardware issue, but judging from these events I don't know whether it's a hard drive, the SCSI controller or the Smart Array controller (or perhaps something else entirely).

 

If it helps, I'll give a quick rundown of the RAID setup:

Logical Drive 1: 3x 72.8 GB, all on firmware HPB3 (it was one of these drives that originally failed)

Logical Drive 2: 3x 146 GB, all on firmware HPB4 (it is one of these drives that fails in the recent events)

 

Thanks for any help you can provide.

 

EDIT: I'm looking into buying a replacement controller just in case but I'm coming across two different part numbers: 291966-B21 and 305414-001.  Which one should I be getting?

7 REPLIES
PGTRI
Honored Contributor

Re: Smart Array 641 Issues (Proliant ML350 G4)

hi,

 

Can you please attach an ACU report from the server, but please update the ACU version first.

 

regards,

How to Say Thank You? Just click the KUDOS!
VanRaily
Occasional Contributor

Re: Smart Array 641 Issues (Proliant ML350 G4)

Will do.  In the meantime I'd like to get a spare controller anyway, so do you know which is the correct part number (291966-B21 or 305414-001, or does either work)?

PGTRI
Honored Contributor

Re: Smart Array 641 Issues (Proliant ML350 G4)

hi,

 

the correct spare should be :

 

305414-001 Smart Array 641 Ultra320 Controller with 64MB cache - 64-bit, 133MHz, PCI-X PC board - Does not include a cache module or a battery

 

regards,

How to Say Thank You? Just click the KUDOS!
VanRaily
Occasional Contributor

Re: Smart Array 641 Issues (Proliant ML350 G4)

Thanks again.  This might be a dumb question, but I can just use the cache module from my existing card, correct?  There aren't any issues with transferring the module between controllers?

PGTRI
Honored Contributor

Re: Smart Array 641 Issues (Proliant ML350 G4)

hi,

 

Sure , you can use the cache from the old controller,if it is not defect.

 

regards,

How to Say Thank You? Just click the KUDOS!
VanRaily
Occasional Contributor

Re: Smart Array 641 Issues (Proliant ML350 G4)

I have that ACU diagnostic report ready, but before I publish it I need to ask: is there anything sensitive in it that I should remove first?  I did a quick lookover and nothing appears like it should be hidden but I want to make sure first.

 

Also, in other news I looked at those cpqcissm event log entries again and the period in which the controller goes "unresponsive" is also the time that we run run a backup of our largest database (>12 GB).  I've confirmed this by moving the backup time by about an hour and a half; the cpqcissm events followed to that new time as well.  I don't think that this is a coincidence.  Would an extended time (about 15 minutes) of I/O-heavy operations cause the controller to be slow enough to respond as to trigger this event?

gregersenj
HPE Pro

Re: Smart Array 641 Issues (Proliant ML350 G4)

There's nothing sensitive in the ADU report.

 

Do i understand correctly, that the probelm ocoured after the disk replacement?

If so, it could be a bad spare drive.

 

The System Management Homepage is your best freind.

On the SMH you can read the statistics of the drives, in an easy to read manner.

Check all 5 drives, but do pay attension to Drive ID5 and the previous failing drive.

 

I will recommend you to upgrade BIOS / FW and drivers.

On the Array controller there's fixes to bus downshift problems etc.

 

Also, consider to reseat the previous failing drive.

 

BR

/jag