Tips and Tricks
A place to store useful info I don't want to lose



Subscribe to "Tips and Tricks" in Radio UserLand.

Click to see the XML version of this web page.

Click here to send an email to the editor of this weblog.
 

 

 

  

ANATOMY OF A DISK HARDWARE FAILURE

By John L. Joseph
Diskeeper Development Section, Executive Software

INSIGHT eLetter - Volume 7, Issue 10 November 2002

Last week, I went through some hard times and I thought I would share them (and the final solution) with you. My main development system, running Windows XP, used an IBM DTLA-307045 IDE/ATA 46.1GB disk drive, manufactured October 2000, as a Master on IDEcontroller 0. It had several partitions, but it used the first partition (C:,~6GB) as an NTFS system and boot volume* with 4096-byte clusters, and the second partition (D:,~10GB) was an NTFS boot volume with 16K-byte clusters. (I decided to go with 16K clusters on the D: drive since Windows XP supports defragmentation of volumes with cluster sizes up to 64K bytes.)

As is my practice (and as I have recommended in our newsletter), I used the D: volume as my primary boot, and used the C: volume as a backup boot. The rest of the disk was a volume containing my development tools and source. On Wednesday, I came in and flipped my system on. Windows XP took an awfully long time to go through its boot sequence, and sat at the screen with the pulsing progress bar for about 10 more minutes than usual. Then a blue-background screen appeared and told me "Inaccessible boot device". Gulp.

I shooed everyone out of my office and began the diagnosis. Yes, the drive was installed. Yes, the BIOS saw it. Yes, the BIOS disk geometry parameters were right. Reboot. "Inaccessible boot device". This was not good. Of course, I have a backup boot, so I booted to it. It took a long time, but it came up! The drive with my sources on it was accessible, so I copied the recently modified files to the server and began looking further. Any attempt to access the D: drive was met with a four-pulse buzzing sound and a locked-up system for about 10 minutes, then I could use the C: boot again.

In fact, the sound was just like this: http://www.geocities.com/bontemps4/drive_noise.mp3

At this point I couldn't afford to dawdle over the disk drive; I had to have a workable development environment operational on that machine right away to get the salvaged code into the day's build. I spent the rest of the day reinstalling software onto a new drive that wasn't having hardware problems. Later (much later!) the next day I had enough time to look into the problems on the drive. I connected the drive to a totally different machine (as a slave drive, not the system drive) and got the same symptoms. The D: drive simply would not mount, and every file on it was apparently lost. But I don't give up that easily. I copied a tool called DSKPROBE to the machine I was working on and set to work finding out exactly why this particular partition wouldn't mount.

DSKPROBE.EXE is a marvelous little tool provided by Microsoft and I got my copy from the Windows 2000 support tools:
http://support.microsoft.com/default.aspx?scid=kb;en-us;301423
It's also available in the Windows NT4 Resource Kit. The neat thing about DSKPROBE is that it can read "raw" drives; accessing the "unmountable" drive presented no problems. In the boot-sector for NTFS volumes is a pointer to the MFT and the MFTMirr. Both of these areas on the volume contain NTFS metadata, and I knew that the problem had to be in the metadata, because otherwise the drive would mount. I was actually suspecting that the problem was in the $Volume file (NTFS metadata file 3) because that's a key file for determining the volume parameters. But, being methodical, I made sure that the first 1024 sectors on the volume could be read; they could. This meant a key piece of data needed for booting XP was accessible. Then I tried reading the first 1024 sectors of the MFT (by using the MFT pointer in the boot-sector). Again, all accessible. Then I tried reading 1024 sectors of the disk starting at the MFTMirr location. BANG! It failed. (And I had to sit through another ten minutes of the drive making that dreadful noise.)

When I got control of the machine back, I started narrowing down the sector that the problem was occurring in by using a "bracketing" technique. You know: I read in 512 sectors at the same starting address and the problem didn't occur. Then I read in 750 sectors and it did. So I read in 600 and it didn't. And so forth. I finally ended up with sector 632 being the culprit. Any attempt to read sector 632 resulted in the error. Reading sector 631 or 633 didn't produce a problem. So I looked at the contents of sectors 631 and 633 and saw that they contained pieces of the volume bitmap. (The NTFS file $BITMAP contains a bitmap of the usage of clusters on the volume.) So I read in sector 631, changed its contents to have all bits turned on (to indicate that all clusters in that range are in use), and told DSKPROBE to write it back to sector 632. Poof. It wrote without error. Then I exited DSKPROBE and went back to the Disk Management tool and asked it to rescan the disks.

Sure enough, the problem drive now mounted without problems, and all files on it were accessible. This story went down this way because I kept my cool, reserved the evidence (the "bad" disk), and figured out what I was looking at each step of the way. If this problem had happened during the installation of a piece of software, or while running a word processing program, it would have been very easy to claim that the installation had "caused" the problem or that the word processor had "caused" the problem. And if I'd panicked and reinstalled everything, the vexatious sector would have gotten overwritten (just like I did with DSKPROBE) and the blame would have been laid incorrectly on the shoulders of the installation program or the word processor. So why did this happen? I guess I should have been watching the Internet a little closer, or I might have picked up this article: http://www.tech-report.com/onearticle.x/3035  Apparently I'm not the only one experiencing failures on this model of drive, and it looks like I was lucky to have the drive survive two years...others didn't get that much life out of theirs. Regardless, the key point is that I was able to recover everything and isolate and locate the correct problem primarily by keeping my cool. And I have yet another postscript for those loyal fans who wrote in to tell me that Windows XP is immune from 1023-cylinder problems. See this Knowledge Base article:
http://support.microsoft.com/default.aspx?scid=kb;en-us;282191

* (definitions taken from the Windows 2000 online glossary)

Boot Partition: The partition that contains the Windows 2000 operating system and its support files.  The boot partition can be, but does not have to be, the same as the system partition.

Boot Volume: The volume that contains the Windows 2000 operating system and its support files. The boot volume can be, but does not have to be, the same as the system volume.

System Partition: The partition that contains the hardware-specific files needed to load Windows 2000 (for example, Ntldr, Osloader, Boot.ini, Ntdetect.com). The system partition can be, but does not have to be, the same as the boot partition.

System Volume: The volume that contains the hardware-specific files needed to load Windows 2000. The system volume can be, but does not have to be, the same volume as the boot volume.



Click here to visit the Radio UserLand website. © Copyright 2002 Eric Hartwell.
Last update: 12/3/2002; 9:26:02 PM.
This theme is based on the SoundWaves (blue) Manila theme.

November 2002
Sun Mon Tue Wed Thu Fri Sat
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
Oct   Dec


"Data! data! data!" he cried impatiently. "I can't make bricks without clay."
— Sherlock Holmes to Dr. Watson in "The Adventure of the Copper Beeches" by Arthur Conan Doyle. 


"I like deadlines," cartoonist Scott Adams once said. "I especially like the whooshing sound they make as they fly by."


"There is nothing like that feeling of spending days and days banging your head against a wall trying to solve a programming problem then suddenly finding that one tiny obscure and seemingly unrelated piece of the puzzle that unlocks the solution. Oh yeah!"

- Chris Maunder, CodeProject Newsletter 28 Jan 2002


"Management at eSnipe, which is me, is also feeling the pain of the 2002 bear market. So rather than pout about it, I bought some stuff on eBay that I really didn’t need, but made me feel better."

- Tom Campbell, president of eSnipe