Thursday, April 9th:
Easter time. Which means a lot of crap food, annoying company (my family is no fun business in these times) and no servers as far as I can see. I pack up my trusty Macbook Pro and head out into the world (2 hour trip by bus, train, subway and bus again). I arrive safely (rats, didn't get mugged today either, I soo want my Macbook to get stolen... not.) and quickly reserve a place on a table for my laptop, a bluetooth mouse, enough power to keep my laptop running for days and a comfy chair. I boot it up into Windows 7, connect to the WLAN and make sure Hamachi (private VPN) is running. Oh joy, all my machines at home are online and I am back into business again.
I connect by Remote Desktop and start working on my "workstation" as I always do, repacking stuff, downloading new stuff etc. Sure I miss the humming noise of my server and there is a definite lag when I do stuff, but it's better than nothing. I spend the evening by playing Team Fortress 2 with my nephew and talk to the family.
Friday, April 10th:
Aaah, sweet morning. Beautiful weather (I hate it, too much reflections on the screen!), it's really spring weather. As most days I start with booting up the laptop (I never turn off the machines at home). and log on to my server at home. Everything seems fine, workstation has finished doing what it was supposed to and I am thinking about downloading a few games to play (legally owned of course! ).
After a few hours everything stops working all of a sudden. I get disconnected and I notice that after a short time all my nodes disconnects from Hamachi. My website isn't accessible anymore and everything seems dead.
I don't panic tho, it has happened before. I've worked out a theory that computers are indeed alive and have emotions. Because it has happened before that after months of flawless operation the network decides to die as soon as I leave the apartment for a few days. When I am at work it always works (it knows I only leave for a few hours), but when I leave on other occasions it gets worried and apparently get a digital stroke or something.
So I ask my sis husband that I need to make a detour for a few hours to get home and restart (read: talk softly to the server, calm it down and make promises it will get taken care of (bribing it with new RAM upgrade etc works too)) the system. He says that he needs to get his Jaguar on the street so he offers to drive me to my apartment and back again. Great! We head off and about one hour later I am at home. After a quick ping I realise that for some reason my linux router got a stroke. This is quite suprising since it's been running flawlessly for several years now. (footnote: It's a Pentium 4 3GHz with dual gigabit cards and one dual channel fibre channel adapter, running Smoothwall, it handles my routing and DNS stuff). But I reboot the machine (takes about 10 seconds) and everything is working again. Ping works and my machines are connected again. Yay!
So we drive back, I reconnect and things seems to be working again. The rest of the day is quite uneventful, I spend more time by talking to the family, playing Team Fortress 2 with my nephew and even spend a few hours with GTA4 (the Macbook is amazingly powerful when it comes to games!). During all this time I keep my workstation at home busy with boring stuff such as packing up new stuff, creating par2-files etc.
Later in the evening a TBG member (The Beta Group, an another beta community site for those of you that are unaware of it.) told me on MSN that he couldn't access the FTP. After a quick look it appeared that his home folder was inaccessible for some strange reason. The folder was there, but Windows refused to access it. Ah well, just another quick run of chkdsk I guessed, sometimes Windows gets confused with folders and it's usually fixed by chkdsk. So I run it, but Windows says it needs exclusive rights and it can only be run during boot up (hello Microsoft, ever heard of DISMOUNTING VOLUMES!?). So I did what I've done a lot of times before. I set the machine to reboot. The server runs itself so it reboots, scans the drive, fixes it, and then kickstart all the services back into action again. Shouldn't take more than 10-15 minutes tops. The server node on Hamachi should light up when the server is running again, so I wait.
20 minutes later still nothing.... (long chkdsk...)
After 30 minutes I start to wonder what's wrong... I ping the network and the router is still responding, so the network didn't die on me. Ah well, nothing I can do except to wait... By this time it's dinner time so I leave the laptop running.
About one hour later when I return I still see that the server isn't responding. Hamachi doesn't light up, and I can't access the server by running remote desktop. I still have connectivity with my workstation, and it can ping the server, but not connect to it.
Ah well, as long as I can access the workstation I'll be fine. Sure my email server will be down, as well as all my files. But I can still keep my workstation downloading stuff and pack up stuff for my return. All of a sudden the workstation goes offline. So does my gameserver and gaming machine. I forgot that my server is handling all the DHCP requests, so when it's offline all machines will time out eventually and disconnect themselves.
No machines at all! The server must have gotten stuck on some stupid service or something. I am not going to spend 3+ hours to get home and another 3+ hours to get back (it's holiday remember, public transportation is very troublesome at these times). But I'll live, I got my laptop and I got a few games loaded, so I spend the rest of the days using only that. Still hoping for a miracle tho (watching for that little green icon in Hamachi lighting up. Of course it never happens.)
Monday, April 13th
I finally get home. It's been stressful days (aren't these events supposed to be stress free and enjoyable? No, everyone is running around in warp speed buying this and that and preparing for everything... capitalism at its finest...).
My shoes has not touched the ground yet and I am half way across the apartment heading for the server. First thing I notice is an error message... "Could not write to E$, please run chkdsk" (or something like that, the E drive is btw the drive the TBG user folders are stored on). Ah well, another one of those situations where Windows gets ahead of itself and do an improper shutdown. At this point I just wished Windows (this is after all Windows Server2008 - a server product!) had some kind of feature where it forced itself to shutdown if a service got stuck. I.e "wait for service to shutdown - kill the service forcefully after x minutes of failed clean shutdown" etc. Then it would have eventually rebooted anyway, any errors I could always verify by the event log. But no, there it sat waiting. No wonder I never rebooted... stupid popup window...
I clicked OK and after a few minutes the drive started to rattle again and the server rebooted. Yay! During the bootup it ran chkdsk and started to scan the E drive. It found a few errors but I didn't bother since I had backups of everything (backups spread among other drives in case one drive failed), so I could easily see what it fixed and replace it with the backed up file. It cleared that drive, but then it started to scan another drive... and another... and it found quite a few errors... hmmm, sure I have not run chkdsk for a while, but I knew everything else worked since I used to access most stuff on a daily basis (by using the backup app that synced the backups).
The scan took several hours, but finally it was finished, it rebooted and got into Windows. It started up all the services and I was ready to go (or so I thought). I checked the TBG user folders to see if worked again, but I couldn't find the folders. I couldn't even find the drive. I noticed that several drives had swapped places (the contents on E: being on G: etc) and I even noticed that some drives had the same labels (i.e two drives named "(Data) Misc" where there is only supposed to be one drive with that name). (Note: My server has over 20 harddrives connected). After some looking around I notice what I would fear the most...
ALL DATA IS GONE!
Most folders are scrambled, 10 of the harddrives show that they are almost empty (where as most drives are supposed to be 80-90% full), and some files contains just a few restored folders. There are even some drives with mixed folders (folders from several drives all mixed on one drive). I realise that Windows must have for some reason tried to restore OLD data. Since some folders were mixed and labels were duplicated it must have tried to restore the contents I had before (I sometimes move all data from one drive to another if I need more space etc). But all the important data is gone! Over 10TB data is missing!. All my beta stuff. All my music, images, PDF documents, PC and Mac applications, emulators, heck, even the temp harddrives are "gone" and replaced by something else.
F _ _ _ !!!!
Since I made backups of the drives onto other drives there were almost always a backup of one drive stored on an another drive. I.e E: could be backed up to R:, and R: could be backed up to S: etc etc. If one single drive would fail I could always restore the drive by using the backups. But now everything was gone, all the original data and all the backups.
At this time I was thinking up ways to reduce Microsoft HQ into a handful of rubble. I started up the recovery software and started scanning the drives with the most important data.
And I have been doing this all day today as well.
It seems the data is still there. Problem is that the recovery software (I use GetDataBack which has served me well in the past) sees a lot of files. A LOT of them. But it doesn't see the filenames. So I got 100000+ files, all named "". Oh joy!
I am currently looking into trying to restore the previous FAT (file allocation table) and see if I can get the old folder structure back. I am also considering restoring the important files by hand. Since I add PAR2 files to all my folders I could simply restore everything on a drive, run the PAR2 checker and let it rename all the files for me. Then I can simply rename the folder and put it back into its place.
But the hard reality is that most of the data I will never recover. Fortunately I took backup of some stuff onto external drives (some games, the TBG server files, personal betas etc) so I didn't lose that data. But it's still hard, since I've spent years collecting some of it. And now it's gone only because Microsoft made a dodgy app that didn't even ask me what to do. It just assumed it did the right thing and fixed it, without even allowing me to restore back to the previous "state".
I am going to spend a few more days on this problem and see if I can come up with some solution. After all the data is still there, just "hidden".
Lessons learned: (and people, REALLY pay attention to this if you got vital data you really don't want to lose...)
- Never ever trust Microsoft
- Never trust an antiquated file system (NTFS) to safely store your data. It lacks several important features that could have prevented this catastrophy. Such as proper journaling (with ext4 and ZFS I can even choose at what journal point I want to restore the file system) and data protection and redundancy when writing to system areas.
- NEVER USE RAID. It doesn't protect your data, it only adds to redundancy in case hardware fails. But it never protects the data itself from corruption like this. I used RAID before and had a big data loss back then too, most of which I salvaged (it's gone now tho!)
- ALWAYS sync your data to an external drive, and keep it offline until you need to re-sync! I did this with some of my data and it's safe. Unfortunately mirroring 30TB of data is quite expensive which is why I went with the backup solution I used... I never expected the entire system to fail, just a drive or two occasionally.
I am still going to work on this problem... I already purchased 4 new 1TB drives to mirror all the data I've saved, and I have also gotten new external eSATA cases for them, all with independent power supplies (read my project page on my website and you'll know why this is important for me). It will cost me my other let to afford it, but perhaps it will be an another step to prevent this kind of failures. I will setup a separate machine for my email server, rebuild and reinstall the current server into a pure file server (perhaps I'll run Linux on it, perhaps a very streamlined Windows 2008).
I will restore the TBG files as soon as I can (check the TBG forums for updates) and I'll update this log with my progress.
If anyone has any ideas on how to restore "lost" files on a drive let me know. But from what I can see Windows has either corrupted or lost the current file allocation table (file index table) and tried to fix it by "recovering" files. Of course by actually deleting the indexes off the drive. I will only touch the drives that are completely OK and I'll mirror them (doing that now). All redundant drives will be disconnected from the server, and I will most likely disconnect the "damaged" drives as well, preventing Windows from doing any more harm to them until I either find a way to restore the data, or until I give up and reformat them.
This has been a very long post (my longest so far), but perhaps you understand why . Unless some other catastrophy will land on me I'll keep the future logs a bit shorter.