- Integrated Systems
- About Us
- Integrated Systems
- About Us
04-18-2018 04:26 PM
RAID 5 on Proliant DL380 G7 - Double HDD failure
Could someone help me please, this situation is quite critical for me:
I'm an IT enthusiast however all my knowledge comes from self-learning since 1999, and not exactly from formal training. The company I work for is a subsidiary of another global company based in Belgium. Our ICT department is based there, however, due to my relevant knowledge I provide local IT support together with my manager who is also an IT advanced user, we both have admin rights, but not the authority to manage our local server and Meraki switches for example.
Long story short: I've been accused of causing data loss (only local fortunately, since our local machine synchronises with a datacentre based in Belgium, Thank God!). This has reduced our workload in about 50% for 3 days as most of our files are located on the server HDDs.
This is the sequence of events:
- 8 am as I arrive at work the office is still empty, one member of our workshop team reports he cannot access any files. I go to check the server, and he follows me.
- I see that out of 8 drives, one (I may refer it as drive A) had a static amber light and the other one (B), was completely off, so no LEDs lit. (at this point I should have taken a photograph and stupidly I haven't). The remaining 6 drives had displayed static green light only.
- I hot swap the drive with the static amber light (A) and plug it into an empty slot. Nothing happens, LEDs do not light up. Actually our ICT had sent replacement drives earlier, so I've replaced the failed drive with a new one (and this was the last spare we’ve had). Nothing happens, no LEDs light up.
- I hot swap the offline drive (B) and swapped with a healthy drive (C) on slot 3, just to rule out any problems with the connector.
- Drives A, B, and C are now offline. All others display static green.
- I call our ICT department to report the problem, and they've asked me to plug a keyboard and a monitor to the sever so I could restart it.
- The O/S would not respond, it could not restart by pressing F12.
- ICT asks me to do a hard restart (i.e. press the power button for a few seconds).
- Now everything looks ok, all drives are flashing green simultaneously, but we can’t access any data.
- My manager decides to call our managed services supplier and ask them to send one of their technicians, they see that RAID 5 configuration was a complete mess because I’ve hot swapped the drives earlier.
- At this point I realize I’ve made a big mistake as I’ve wrongly assumed that the drives would synchronize regardless of the position, I also was unaware of the RAID 5 configuration. It was stupid, I know.
- I get blamed for the local data loss because I have hot-swapped more than one drive, I’m immediately told that because of this, they cannot rebuild the array and recover data.
- ICT creates a virtual server so we can stream data from the datacentre, but this takes 3 days causing some delays to our work.
- I start my own investigation, and I send a formal email to ICT explaining that I’ve made a mistake whilst trying to help, it is a fact. However my actions have not caused any data loss because 2 drives had failed overnight at the same time on a RAID 5 configuration. Despite my actions, it was already too late.
- ICT tells me that it is not true, and that hot-swapping more than 1 drive has caused array misconfiguration and consequent data loss. But they cannot technically prove it, they’ve only got my statement and our managed services supplier’s technician opinion.
A few notes:
- On the day before the failure, I’ve noticed degraded performance shortly before we had finished for the day, but I haven’t checked the server as we were too busy, and it is not really my job to monitor the server, I just try to help as required.
- The machine is quite old, it was handed to us 5 years ago, and it was already second hand.
- We’ve been sent 3 spare drives since 2016, the machine has experienced 2 drive failures (separate events). The 3rd failure refers to this event.
- My manager has previously alerted the ICT about the reliability of the server, performance degradation, and frequent failure of HDDs.
- I’m in a forum, I don’t need to lie here, there’s no way I would waste my time creating this massive text if I know I was at fault. Nevertheless, I’m conscious about my mistake, lesson learned, it’s not my job, I love I.T., but I won’t touch any other machines they put here.
- My manager fortunately believes me, he’s actually fed up with the ICT team because they’ve ignored his warnings about the machine’s reliability.
In summary, I would like to know whether there is a way to see a log with the date and time of both HDD failures. The machine has been replaced and it will soon return to Belgium, but I would like to prove my point before it goes.
This is because despite my email, they keep insisting it is my fault. As far as I’m aware there is no consequence for my actions because these cannot be proved, it is basically my word against theirs. Hence a log would definitely help.
Thanks in advance.