Thursday, February 26, 2015

FreeBSD From the Trenches: ZFS, and How to Make a Foot Cannon

This month's story comes to us from Glen Barber, UNIX Systems Administrator.

The ZFS filesystem is regarded for its robustness and extensive feature set.

Its robustness can be haunting, however, if a mistake is made.  I learned this the hard way through a seemingly innocent typo, a mistake I certainly will not soon repeat.

We use ZFS almost exclusively in the FreeBSD cluster.  I say "almost" because there is one remaining machine that does not use ZFS, because the machine is too underpowered to handle it.

All machines are installed in a netboot environment while logged in at the serial console, providing the utilities necessary for extremely customizable installations.  Most of the installations I have performed on machines in the FreeBSD.org cluster have been pseudo-scripted, with subtle differences depending on the machine, such as if the disks are da(4) or ada(4), the number of disks, how much space to allocate for swap, the number of ZFS pools, and so on.

For the most part, a basic installation would be done with a very simple sh(1) script that looks something like:

# for i in $(sysctl -n kern.disks); do \
  gpart create -s gpt $i; [...]; done
Nothing too fancy at all.

Most times I would copy/paste from an installation script I've used for years, other times I would manually type the commands.  It really depended on what the end result was supposed to be, as far as configuration.

When I installed the FreeBSD Foundation's new server, I typed the commands manually.  You might ask, "Why did you do it this way?"  To this day, I cannot answer that question.  But if I didn't, this story would be far less interesting.

The machine was installed like this, almost verbatim:
# for i in $(sysctl -n kern.disks); do \
  gpart create -s gpt /dev/${i}; \
  gpart add -t freebsd-boot -s 512k -i 1 /dev/${i}; \
  gpart bootcode -b /boot/pmbr \
  -p /boot/gptzfsboot -i 1 /dev/${i}; \
  gpart add -t freebsd-swap -s 16G -i 2 /dev/${i}; \
  gpart add -t freebsd-zfs -i 3 /dev/${i}; \
  done
# zpool create zroot mirror /dev/ada0 /dev/ada1
# for i in tmp var var/tmp var/log \
  var/db usr usr/local usr/home; do \
  zfs create -o atime=off zroot/${i}; \
  done
This creates the GPT partition scheme for all available hard disks, writes the partition layout to the disks, writes the GPT boot code to the first partition on each disk, and allocates the swap space and ZFS space.  Then it creates the ZFS pool named 'zroot' configured as a mirror, and creates the ZFS datasets in the new pool.

The problem is not too obvious unless you are looking for it specifically, but instead of using the 'freebsd-zfs' GPT partitions, which are /dev/ada0p3 and /dev/ada1p3, I created the pool on the full disk (/dev/ada0 and /dev/ada1).

Simple enough to fix, right?  Destroy the 'zroot' pool, destroy the GPT partition layout to be safe, and create it again with the correct arguments to 'zpool create'.

So, that's what I did.

Luckily I wasn't ready to put this machine into production yet.  I still wanted to do some basic stress testing on the machine before moving anything critical to it.

Fast forward about a month.

After being satisfied that the machine did not have any obvious stability problems, such as faulty RAM for example, and after having lowered the relevant TTL entries in DNS, I decided to do one more upgrade on the machine before beginning the independent service migrations to the new machine.

This is where things started to go wrong.  Fast.

The source-based upgrade finished, and I rebooted the machine.  In another terminal, attached to the serial console, saw the machine proceed through the normal reboot routines, killing running services, syncing buffers, and so on.

After the machine completed POST routines, everything went dark.  The machine did not respond to serial console input, and as far as I could tell, this was not due to a change caused by the update.

I should note that, by nature, I am a paranoid sysadmin.  This is a good quality, in my opinion, because I habitually go out of my way to make sure any situation is recoverable if something goes wrong.  Suspecting I did something wrong, I immediately began reviewing the history recorded while being logged in at the console.  Nothing looked suspicious.  This upgrade should have "just worked."

I remotely power-cycled the machine, and booted into our netboot environment to investigate further.

I immediately knew something went wrong after importing the 'zroot' pool into a temporary location, and seeing several tell-tale signs.  For starters, /etc/rc.conf had a timestamp that predated the machine from even being shipped to the colocation facility.  More confusingly, /usr/obj was empty, as if the 'buildworld/buildkernel'-style upgrade that took place less than an hour prior had never happened.

Then panic ensued.  The machine didn't panic -- I did.

Everything was gone.

Every configuration change since the initial install, every jail that was created, every package that was installed.  All of it.  Just gone.

While investigating, I sent a heads-up to the other cluster administrators in case there was an issue that affected other installations.  As investigation progressed, Peter realized he had seen this exact behavior in the past, and provided an example scenario with which it could occur.

It was exactly what I had done - used the raw disk for the ZFS pool instead of the 'freebsd-zfs' GPT partition.

So, what's the problem?

The problem is 'zpool destroy' does not implicitly delete pool metadata from the disks, so as far as ZFS is concerned, I had two different ZFS pools, both named 'zroot', which confused the boot blocks just enough to import the wrong pool at boot.  Well, it didn't just import the wrong pool, it imported an empty pool.

Worse yet, because I had allocated the partitions in the order of 'freebsd-boot', 'freebsd-swap', and 'freebsd-zfs', and that 'freebsd-swap' consisted of 16GB, the swap partition had more than enough space to hold on to the metadata from the pool I did not want to exist.  There was no way to force one pool to be chosen over the other, and worse, no way to tell which pool would be chosen by the loader.

The only good news at this point was that the machine was not yet in production.

How do you fix this, then?

Peter had a suggestion, since he has run into this before.  Reboot the machine into the netboot environment, and try to force the correct pool into being imported by forcibly removing all device entries for the disks and retrying the ZFS pool import.  This would be done by running:
# rm -f /dev/gptid/* /dev/diskid/* /dev/ada?
# zpool import -o altroot=/tmp/zroot zroot
Unfortunately, the wrong pool was imported again, most likely (but unconfirmed) by allocation such a large amount of swap to the disks.
# zpool status
     NAME        STATE    READ WRITE CKSUM
     zroot       ONLINE 
       mirror-0  ONLINE 
         ada0    ONLINE 
         ada1    ONLINE
Then I realized the partition table was also corrupt.

After several attempts to coerce the correct pool to import, I became increasingly more uncomfortable with leaving the machine in this condition. At this point, there was only one solution - wipe the disks, and start over.

Ultimately, despite disliking the solution, that is what I did to correct the problem, though at the time, I was unaware of the 'labelclear' command to zpool(8), which would have wiped the ZFS pool metadata from the disks.  But at that point, I was not going to take any chances either way.

The takeaway is, despite how innocent a mistake may appear at first, when dealing with metadata stored on disk devices, it surely will come back to haunt you at some point sooner or later.

Wednesday, February 25, 2015

SCALE 13x Trip Report: Michael Dexter

The Foundation recently sponsored Michael Dexter to attend SCALE 13x. Michael provides the following trip report:

SCALE 13x was the 13th Southern California Linux Expo and took place February 19th through 20th in Los Angeles, California. Despite its name, this year's event demonstrated sincere outreach to the BSD community as demonstrated by two booths and several BSD-related talks. The first booth featured FreeBSD, the FreeBSD Foundation, FreeNAS, PC-BSD and pfSense while the second featured OpenBSD and NetBSD. Both booths were filled with familiar faces including Dru Lavigne, Denise Ebery, Matt Olander, James Nixon, David Maxwell, Brooke and Seth and two toddlers!

The FreeBSD Booth Crew -
Photo courtesy of iXsystems

The variety of booth visitors were very familiar for SCALE: a mix of students, consultants, open source developers and military/aerospace contractors. I heard lots of "I got started on FreeBSD" and "I use FreeNAS" plus the occasional "When can we have a military-certified BSD so we can stop using Linux?" The last one is something I have heard at every SCALE I have attended and is representative of the region. Hats off to the SCALE organizers for also attracting such a diverse
audience.

The BSD-related talk topics included David Maxwell's newly-released pipecut that he debuted at MeetBSD (https://code.google.com/p/pipecut/), Brooks Davis' talk on the BERI CPU that he is working on with Robert Watson, Dru Lavigne's talk on new FreeNAS 9.3 features and my talk on FreeBSD Virtualization Options. There were also many overlapping talks such as those on various system containers, embedded systems and of course Brendan Gregg's talk on systems performance. Brendan kindly updated the Netflix statistics that I was already going to address and both Bryan Smith and Randal Schwartz had great user questions. It truly was a pleasure to speak at SCALE and my sincerest thanks to Brendan for live Tweeting my talk.

Impressively, some SCALE speakers were in their teens and the overall outreach to kids was great including an evening kids-only event. The BSD Certification Group scheduled a BSDA exam but alas it was poorly attended. I humbly invite you to take the BSDA exam if you have not done so already and ask that you help spread the word whenever you get a chance.

In a community where we often preach to the converted, I find SCALE to be a very receptive venue for outreach and encourage you to attend and consider submitting a BSD-related talk to SCALE 14x. Special thanks to Gareth Greenaway for reaching out to the BSD community and for the great attitude demonstrated by his team of volunteers. Finally, I would like to thank the FreeBSD Foundation for covering my air travel and O'Reilly Media for allowing me to share a room with one of their amazing team members.