The Legend of the ‘Punctured Stripe’

Here where I work we have a dedicated IIS web server we'll call IISServ. IISServ is configured with a RAID 1 array and a non-RAID single disk for expanded storage of logs. Awhile back, IISServ was moved down to a new spot on the rack it was in. Magically, after the move, the computer started reporting I/O and other CRC errors on drive C...Â

I called dell and told them about the errors and they had me run a diagnostic on the whole RAID controller. The diagnostic reported back errors on disk 0. Disk 0 was the first disk in the RAID1 array. Dell shipped me a new disk and I swapped it out while thinking about how easy the whole process was.Â

A few days later we received some more I/O and read errors that were canceling our shadow copy and remote backup process for IISServ. After I ran some updates, IISServ failed to reboot properly and hung somewhere on the action of booting into the OS. I removed the RAID1 drive that was not just recently replaced and the computer booted normally. After the computer booted I re-inserted the bad drive. I called Dell again and they had me run another diagnostic on the RAID array. This time errors appeared on the second drive. Dell sent me another good drive. I hot-swapped the new drive into IISServ.

</spanSo, at this point I have two new drives in the computer. After a day or two I still received more errors and the system state backup still failed.  I ran a chkdsk. Chkdsk reported some errors, so I scheduled some downtime to do a chkdsk /r. After the chkdsk /r a few sectors were flagged as ‘bad’ when I did a chkdsk again – great news, right? STILL the error persisted and I assumed the only problem could be with the RAID controller it’s self. The server is under warranty, so back to Dell I went again.Â

</spanDell had me run the diagnostic for the controller again and this time BOTH of the new drives in the RAID1 showed errors! Dell had me run a reporting tool and send them some log information from the RAID controller. They scanned the logs and found both disks 0 and 1 (the two in the RAID1) were reporting bad sectors in the exact same places. Dell forwarded me up to their manager, or specialist, or whatever they have and he explained to me his version of the problem. He called it a ‘Punctured Stripe’, which nobody in my office including me had ever heard of. He explained the error was on the logical layer of the array and not on the physical disks. He also explained to me that consistency checks are the preventative maintenance for this. He also went on and explained to me that there was no application or process for repairing this and informed me the only way to repair the problem was to RECREATE THE ARRAY from scratch. That’s right, destroy the entire array and all the data and start over. I guess I should have ran more constancy checks, but even then he admitted, this problem can still occur.Â

</spanSo I did what any good tech does when he thinks he hears a bunch of baloney, I Googled it. “Punctured Stripe” with quotes came up with about 4 results. One guy was ranting about how people use RAID 1 as their backup solution, which it is not intended to do. Another search found someone that explained that RAID1 is only effective at preventing complete drive hardware failure, and is susceptible to passing corrupt data between drives. I even found someone posting with the thread title ‘RAID1 is useless’, ranting about having the same problem we have. I also went so far as to call someone who does hard drive recovery and ask him if he had ever heard of a ‘Punctured Stripe’. He said he had never heard of it, but he had heard of corrupt data being replicated on the logical layer of a RAID1, effectively running the entire array and requiring it to be recreated.Â

</spanSo here I am, stuck with a busy web server that cannot backup the system state with shadow copy and cannot be repaired without destroying the entire RAID array the OS sits on. I am trying to figure out how this happened, how we could have prevented it, and what good RAID1 is if corrupt data can ruin the array so easily. From the looks of it, RAID1 isn’t such a great redunancy solution!

Update 6/14/07 - Found that Dell released a utility for dealing with this problem.