ProLiant Servers (ML,DL,SL)
Showing results for 
Search instead for 
Did you mean: 

Data Loss After PFA Alert !

John Townley
Occasional Visitor

Data Loss After PFA Alert !

A curious thing happened to us yesterday.....

One of our clusters has several Raid 1 & Raid 5 arrays connected to a Raid Array 4100.

One of the disks in the Raid 5 array was reporting a PFA Alert. The stats on the drive confirmed this with 2892 Hard Read errors and 1 Recovered write error.

The event log on the NT Cluster reported (repeatedly) "The device, \Device\ScsiPort3, did not respond within the timeout period. "

However, before we could replace the disk, the oracle database on the cluster reported and OS error 21 and took datafile 67 offline. The cluster service also report "Cluster disk resource 'Disk L' did not respond to a SCSI inquiry command." and "Cluster resource 'Disk L' failed. "

From my view I would suggest that due to the disk errors, NT has failed to access the raid array for a short period, in which time oracle has timed out and failed.

The Oracle error is upsetting as we have had to do media recovery on a 150Gb Oracle database.

We replaced the 36Gb disk with a new one, the array has rebuilt OK.

Finally then, my question... Why in the situation described above have we effectivley lost data? The reason we used a raid 5 array to to provide fault tolerance in the event of a disk failure. Does this imply something else is wrong (other than the duff disk)

Smoke me a kipper skipper I'll be home in time for tea