Operating System - HP-UX
1836579 Members
2212 Online
110102 Solutions
New Discussion

Database corruption with failed drive and lvsync

 
Greg OBarr
Regular Advisor

Database corruption with failed drive and lvsync

I had a disaster two weeks ago that resulted in database corruption and data loss. We were trying to attach another disk tray to a FC Raid controller online (no power down, database running on RAID on other disks on the controller). Trouble was that the disk tray has an ID that deternines the ID of the disks to the RAID controller, and the ID of the new tray was the same as one of the existing ones (we didn't know to check this).

OK, that was the first stupid mistake. So the database was running on mirrored logical volumes, and when the disk tray was brought online on the one RAID controller, it freaked out and that volume went offline. So the database (Oracle) continued to run without issue, reading and writing to the mirror.

When this happened, I immediately powered the RAID controller down, took out the new disk tray, called the mfg. and found out about the ID conflict. So I disconnected the new disk tray and powered up the controller and disks again, as they were originally. One of the disks in the RAID was marked "bad" and was automatically replaced by the online spare, and the volume was rebuilding from parity.

I don't recall doing it, but I must have run lvsync while the RAID was rebuilding. Apparently, something in this combination caused data corruption. The database was still running throughout this process. Strangely, I see some messages in the syslog file about SCSI write errors and it looks like the FC connection was going up and down while the lvsync was running.

Anyway, I ended up with corruption in some Oracle data files and some archive logs were corrupt, so I couldn't restore back to the current time from backup.


Anybody ever seen this before? What did I do wrong? (aside from get up that morning?)

See attached section of the syslog file during the time this occurred.
2 REPLIES 2
Navid Hussain_1
Advisor

Re: Database corruption with failed drive and lvsync

Hi,

It is preferable to get down time for adding and JBOD to avoid any problem. Then check from ODE that new addition is visible

"Remember - Precaution is always better than cure"

Fine, ID conflict was there, even though, having power down the new JBOD, you should not execute any command during rebuild operation. After rebuild ops , you need to take other steps...

Anyway, we learn from mistakes...

Cheers ..

NH

Greg OBarr
Regular Advisor

Re: Database corruption with failed drive and lvsync

Yes, definitely a lesson learned here. If the IDs hadn't been in conflict, everything would have been smooth though. I turned a situation where I was looking for NO downtime into a considerable amount of downtime.

I was trying to piece together how this data corruption occurred across the mirror. I'm thinking that when the drive tray was plugged in on the RAID on one side of the Mirror/UX mirror, it corrupted the data on the RAID volume there. There was about 30 seconds elapsed befoe the RAID controller alarm went off and the RAID was disabled there. I think that Oracle and/or the OS could have requested many data blocks from that volume and then written back to both sides of the mirror, corrupting data on both mirror copies. Then there is the fact that the RAID controller was rebooted and stated rebuilding and the mirror resyncing, possibly with corrupt data on the stale drive.