Coming Home to Windows Home Server, Part 5
Mar 19, 2008 12:00 PM, Eric B. Rux
WHS File-Corruption Problem: Why the Long Wait for the Fix?
Nearly three months have passed since Microsoft announced that Windows Home Server (WHS) had a file-corruption problem. Microsoft has dutifully kept us in the loop as it learns more about the problem and has mentioned that we might see a fix in June (at the earliest)--a full 6 months after the company acknowledged the bug. You might be asking, “What the heck is taking so long!?”
When I first heard about the problem, I wasn’t concerned. Our friends in Redmond deal with these kinds of problems all the time; how long could it take? How complicated could the fix be?
To find out the reason for the long wait, I searched online for some details and later even contacted Microsoft’s WHS PR firm to get some answers. I knew that WHS was built upon Microsoft Small Business Server (SBS) 2003, and SBS was built on top of Windows Server 2003. Neither of these platforms suffers from the file-corruption problem, so why is WHS vulnerable?
At first, I didn’t believe the warning. Because this product is aimed at everyday home users, I assumed this was Microsoft’s way of reducing the number of support calls it received. So, instead of logging onto the server, I reverted to using the console for safely performing administrative tasks.
Why can’t we use standard administrative tools such as Disk Manager in WHS? Again, the answer lies in WHS’s target customer: the average home user. When it comes to disks and disk management, Microsoft wanted to introduce a different paradigm than classic RAID. WHS needed to let users add additional hard disks and remove older, smaller disks, all while presenting to that user one set of network shares that appeared to grow and grow. WHS also needed a way for users to protect important files in case of a hard disk failure.
To accomplish this feat, Microsoft came up with a new technology called Windows Home Server Drive Extender. This technology takes care of the important “behind the scenes” work that happens when a user adds or removes an internal or external disk drive. WHS servers with single hard disks have it easy: The disk’s first 20GB is partitioned out for the OS, and the rest is used for data storage. This data-storage area is called the primary data partition, and WHS simply stores files there.
As soon as the user adds a second hard disk, the story gets a little more interesting. The files that were located on the primary data partition become small 4KB “pointers” (Microsoft calls them “tombstones”), and the system copies the actual files to the additional drive. For added protection from a hard disk failure, you can enable Protected Storage to ensure that your important files reside on two physical disks. The file will appear to live in the share on the primary data partition, but in reality the system stores the file on a separate disk or disks. This sleight of hand gives the appearance that the primary data partition just keeps growing larger and larger.
My first clue occurred when I logged on to the server via Remote Desktop. Instead of seeing a blank desktop, I was immediately faced with a stern warning: Many standard Windows Server administration tools available from this desktop can break Windows Home Server. Read the Release Documentation before you use any tool on this desktop and proceed with caution.
If all this sounds complicated, that’s because it is. Microsoft has a 20-page technical brief that thoroughly breaks down the technology. But even at this basic level, you can see that there must be a lot going on in the inner-workings of the server to ensure that it sets the tombstone correctly and copies the actual data to the correct drive. This “redirection mechanism,” as Microsoft calls it, has the bug that’s causing the data corruption. The Microsoft article “When certain programs are used to edit or transfer files that are stored on a Windows Home Server-based computer that has more than one hard drive, the files may become corrupted” states the following: “A bug has been discovered in the redirection mechanism which, in certain cases, depending on application use patterns, timing, and workload, may cause interactions between NTFS, the Memory Manager, and the Cache Manager to get out of sync.” In other words, if the conditions are just right, when the system is updating the tombstone and the actual file, something goes haywire. Instead of good, clean data being applied to the file, it gets corrupted instead.
For this reason, it’s very important that you don’t use tools such as Disk Manager to try to manipulate the hard disks. It’s also important to use caution when using Windows Explorer when you’re logged on to the server. To be safe, be sure to access the files via the network share (e.g., \server1public, \server1software). This way, you’ll connect to the tombstone files, and WHS Drive Extender will take care of everything in the background for you.
Now that you know a little bit about how this “redirection mechanism” works, you’ll have a better understanding why the fix is taking so long. But you don’t have to be happy about it.