Migration Party ... Take 2 - 3 Oct 2005 Mike Gnozzio

So, one important detail that Evan left out in his last message was
that, after Ben and I restored the user database (via booting off
the OS X CD and resetting some passwords), we noticed that one of
Ursula's 3 disks wasn't doing any work, even though the System
Monitor said that we were all systems go. Accordingly, we thought
we would confirm that Ursula's screech of death was a disk problem
by physically removing one of the active disks. Here's the logic:
Ursuala sports 3 disks, connected by a hardware RAID controller,
configured to use RAID 5. Now, if one disk in a 3 disk, RAID 5 array
goes down, the system can hobble along, reconstructing the missing
info from the data on the other two disks. If, however, 2 disks are
down, all the system can do is throw its hands in the air and give
up. So, we figured that if we removed one of the good disks, the
system would become momentarily unusable, we'd have a positive
diagnosis that the inactive disk was the problem, and we could then
package it up and ship it off to Apple to be replaced. Sure enough,
when we removed one of the good disks, the system became unusable.
However, much to our surprise, when we popped the good disk back
in, the system remained unusable. In retrospect, this makes sense,
but at 2 AM, we didn't realize what exactly was going on. The
original cause of the alarm actually had nothing to do with a bad
disk or the user database getting deleted. When we were working in
single user mode, Ben accidentally popped the hatch on one of
Ursula's disks. The disk was never removed, so we didn't think this
was a problem, but in fact, when a disk hatch is triggered, an
XServe prepares the disk for removal effectively powering down that
disk. Naturally, this is no different than actually removing the
disk. So, when we closed the hatch, OS X began the process of
rebuilding the missing disk (both as a measure to put old data back
in the event that we had actually replaced it, and to bring the new
disk up to speed on what changed while it was gone). The alarm was
suppose to let us know that we were in a critical state where, if
one more disk went down, all data would be lost. Therefore, removing
a second disk was a really bad idea. When the RAID controller
detected that disk 2 was missing, the XServe realized that there
was no way for it to continue reconstructing data on disk 1, and
so it just gave up -- as expected. Unfortunately, and much to our
surprise, it also flagged disk 2 for reconstruction when it next
appeared -- a task that it will never be able to complete, since
disk 1 isn't in a usable state, having been interrupted in its
reconstruction. Before we removed the disk, Evan made copies of the
scripts we wrote yesterday just in case something like this happened.
Sadly, we don't know where these scripts are any more. Evan thinks
he might have put them on a separate place on Ursula, where we can
no longer get to them.

On a more positive note, we now know how to turn off the alarm (and
will do so shortly). We also have a much better idea now about how
the migration should work.

To make a long story short, the last 24 hours basically never
happened, and we need to try migrating things a second time. So,
come to the WSO meeting this week and we'll talk about when and how
we're going to migrate. Things are always easier the second time
around, so take 2 will actually be fun!