Migration Party ... Take 2 - 3 Oct 2005 Mike Gnozzio So, one important detail that Evan left out in his last message was that, after Ben and I restored the user database (via booting off the OS X CD and resetting some passwords), we noticed that one of Ursula's 3 disks wasn't doing any work, even though the System Monitor said that we were all systems go. Accordingly, we thought we would confirm that Ursula's screech of death was a disk problem by physically removing one of the active disks. Here's the logic: Ursuala sports 3 disks, connected by a hardware RAID controller, configured to use RAID 5. Now, if one disk in a 3 disk, RAID 5 array goes down, the system can hobble along, reconstructing the missing info from the data on the other two disks. If, however, 2 disks are down, all the system can do is throw its hands in the air and give up. So, we figured that if we removed one of the good disks, the system would become momentarily unusable, we'd have a positive diagnosis that the inactive disk was the problem, and we could then package it up and ship it off to Apple to be replaced. Sure enough, when we removed one of the good disks, the system became unusable. However, much to our surprise, when we popped the good disk back in, the system remained unusable. In retrospect, this makes sense, but at 2 AM, we didn't realize what exactly was going on. The original cause of the alarm actually had nothing to do with a bad disk or the user database getting deleted. When we were working in single user mode, Ben accidentally popped the hatch on one of Ursula's disks. The disk was never removed, so we didn't think this was a problem, but in fact, when a disk hatch is triggered, an XServe prepares the disk for removal effectively powering down that disk. Naturally, this is no different than actually removing the disk. So, when we closed the hatch, OS X began the process of rebuilding the missing disk (both as a measure to put old data back in the event that we had actually replaced it, and to bring the new disk up to speed on what changed while it was gone). The alarm was suppose to let us know that we were in a critical state where, if one more disk went down, all data would be lost. Therefore, removing a second disk was a really bad idea. When the RAID controller detected that disk 2 was missing, the XServe realized that there was no way for it to continue reconstructing data on disk 1, and so it just gave up -- as expected. Unfortunately, and much to our surprise, it also flagged disk 2 for reconstruction when it next appeared -- a task that it will never be able to complete, since disk 1 isn't in a usable state, having been interrupted in its reconstruction. Before we removed the disk, Evan made copies of the scripts we wrote yesterday just in case something like this happened. Sadly, we don't know where these scripts are any more. Evan thinks he might have put them on a separate place on Ursula, where we can no longer get to them. On a more positive note, we now know how to turn off the alarm (and will do so shortly). We also have a much better idea now about how the migration should work. To make a long story short, the last 24 hours basically never happened, and we need to try migrating things a second time. So, come to the WSO meeting this week and we'll talk about when and how we're going to migrate. Things are always easier the second time around, so take 2 will actually be fun!