freebsd-update is a great tool. It reduces my workload. It’s hard to use it wrong, but I have managed to do so. So have other people. Yesterday, three of us encountered the same issue. We did it wrong. We didn’t do it wrong yesterday. We did it wrong months ago, and now it’s come back to haunt us.
NOTE if you ever see: Undefined symbol “ssh_explicit_bzero” on a FreeBSD server, you probably have mixed world and userland after a partially completed freebsd-update attempt. Solution: search below for UNAME_r”
Fortunately, after all this trouble and strife, we’ve found ways to improve FreeBSD so these situations are brought to your attention. I’ve also created two Nagios plugins to watch out for ugly situations.
Terminology
In this post, I will talk about jails and basejail. I am using the jail feature of FreeBSD and ezjail is a tool for administering jails. When I refer to a basejail, it refers to the basejail concept used by ezjail. In short, a basejail is a set of common directories and files shared on a read-only basis amongst all the jails on the system via nullfs
When I refer to host, I am referring to the server which is hosting jails, as opposed to the jails themselves, which can also be referred to as a server.
How the problem first appeared
On Wednesday night, after a security announcement, I upgraded four systems at the same time. When done, I rebooted them. I always do this. Three systems came back. One did not. By come back, I mean I could ssh in. But on one host, which I’ll refer to as the i386 host, all services were up and running, but not sshd. The other three hosts seemed OK, but sshd in every jail (yes, every jail) was broken.
NYI to the rescue
I filed a ticket with NYI who investigated and reported back with this result:
# /etc/rc.d/sshd start Performing sanity check on sshd configuration. /usr/sbin/sshd: Undefined symbol "ssh_explicit_bzero" /etc/rc.d/sshd: WARNING: failed precmd routine for sshd
That is why sshd was not running on the i386 server.
Time to search.
Other cases
Ouch. Searching, I found this reference to that error, but the solution made me think there was a problem with freebsd-update. It turns out, there wasn’t, but an easily-reproduced situation causes it to do the wrong thing, while it’s thinking it is doing the right thing.
Tracking down the issue
It was getting late, and all other services were running. I left it and went to bed. Overnight, Erwin Lansing reported the same problem. This meant I really had to open a ticket. A few hours later, a similar ticket was opened.
I searched my email archive and found one of those reboot emails. Those emails look like this:
Subject: Reboot: gelt.example.org 2015-01-15 21:55 +0000 System Events Auto-Submitted: auto-generated MIME-Version: 1.0 (mime-construct 1.11) Message-Id: <20150115215522.EB6571707F@gelt.example.org> Date: Thu, 15 Jan 2015 21:55:22 +0000 (UTC) From: logcheck@gelt.example.org (Logcheck system account) System Events =-=-=-=-=-=-= Jan 15 21:48:19 gelt shutdown: reboot by dan: Jan 15 21:48:21 gelt bacula-fd: Shutting down Bacula service: gelt-fd ... Jan 15 21:48:21 gelt kernel: . Jan 15 21:48:21 gelt kernel: . Jan 15 21:48:22 gelt nrpe[1113]: Caught SIGTERM - shutting down... Jan 15 21:48:22 gelt nrpe[1113]: Daemon shutdown Jan 15 21:48:22 gelt openvpn[1109]: event_wait : Interrupted system call (code=4) Jan 15 21:48:22 gelt openvpn[1109]: ERROR: FreeBSD route delete command failed: external program exited with error status: 77 Jan 15 21:48:22 gelt openvpn[1109]: /sbin/ifconfig tun0 destroy Jan 15 21:48:22 gelt kernel: tun0: link state changed to DOWN Jan 15 21:48:22 gelt openvpn[1109]: FreeBSD 'destroy tun interface' failed (non-critical): external program exited with error status: 1 Jan 15 21:48:22 gelt openvpn[1109]: SIGTERM[hard,] received, process exiting Jan 15 21:48:25 gelt kernel: . Jan 15 21:48:26 gelt ntpd[840]: ntpd exiting on signal 15 Jan 15 21:48:27 gelt syslogd: exiting on signal 15 Jan 15 21:50:18 gelt syslogd: kernel boot file is /boot/kernel/kernel Jan 15 21:50:18 gelt kernel: Waiting (max 60 seconds) for system process `vnlru' to stop...done Jan 15 21:50:18 gelt kernel: Waiting (max 60 seconds) for system process `bufdaemon' to stop...done Jan 15 21:50:18 gelt kernel: Waiting (max 60 seconds) for system process `syncer' to stop... Jan 15 21:50:18 gelt kernel: Syncing disks, vnodes remaining...6 6 6 4 2 0 0 done Jan 15 21:50:18 gelt kernel: All buffers synced. Jan 15 21:50:18 gelt kernel: Uptime: 4m9s Jan 15 21:50:18 gelt kernel: GEOM_MIRROR: Device gm0: provider mirror/gm0 destroyed. Jan 15 21:50:18 gelt kernel: GEOM_MIRROR: Device gm0 destroyed. Jan 15 21:50:18 gelt kernel: usbus0: Controller shutdown Jan 15 21:50:18 gelt kernel: uhub0: at usbus0, port 1, addr 1 (disconnected) Jan 15 21:50:18 gelt kernel: usbus0: Controller shutdown complete Jan 15 21:50:18 gelt kernel: usbus1: Controller shutdown Jan 15 21:50:18 gelt kernel: uhub1: at usbus1, port 1, addr 1 (disconnected) Jan 15 21:50:18 gelt kernel: usbus1: Controller shutdown complete Jan 15 21:50:18 gelt kernel: usbus2: Controller shutdown Jan 15 21:50:18 gelt kernel: uhub2: at usbus2, port 1, addr 1 (disconnected) Jan 15 21:50:18 gelt kernel: usbus2: Controller shutdown complete Jan 15 21:50:18 gelt kernel: usbus3: Controller shutdown Jan 15 21:50:18 gelt kernel: uhub3: at usbus3, port 1, addr 1 (disconnected) Jan 15 21:50:18 gelt kernel: usbus3: Controller shutdown complete Jan 15 21:50:18 gelt kernel: usbus4: Controller shutdown Jan 15 21:50:18 gelt kernel: uhub4: at usbus4, port 1, addr 1 (disconnected) Jan 15 21:50:18 gelt kernel: ugen4.2: <vendor 0x0409> at usbus4 (disconnected) Jan 15 21:50:18 gelt kernel: uhub5: at uhub4, port 1, addr 2 (disconnected) Jan 15 21:50:18 gelt kernel: ugen4.3: <Peppercon AG> at usbus4 (disconnected) Jan 15 21:50:18 gelt kernel: ukbd0: at uhub5, port 3, addr 3 (disconnected) Jan 15 21:50:18 gelt kernel: ums0: at uhub5, port 3, addr 3 (disconnected) Jan 15 21:50:18 gelt kernel: usbus4: Controller shutdown complete Jan 15 21:50:18 gelt kernel: Rebooting... Jan 15 21:50:18 gelt kernel: Copyright (c) 1992-2014 The FreeBSD Project. Jan 15 21:50:18 gelt kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 Jan 15 21:50:18 gelt kernel: The Regents of the University of California. All rights reserved. Jan 15 21:50:18 gelt kernel: FreeBSD is a registered trademark of The FreeBSD Foundation. Jan 15 21:50:18 gelt kernel: FreeBSD 9.3-RELEASE-p5 #0: Mon Nov 3 22:02:57 UTC 2014 Jan 15 21:50:18 gelt kernel: root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC i386 Jan 15 21:50:18 gelt kernel: gcc version 4.2.1 20070831 patched [FreeBSD] Jan 15 21:50:18 gelt kernel: can't re-use a leaf (geom_label)! Jan 15 21:50:18 gelt kernel: can't re-use a leaf (geom_part_gpt)! Jan 15 21:50:18 gelt kernel: module_register: module g_label already exists! Jan 15 21:50:18 gelt kernel: Module g_label failed to register: 17 Jan 15 21:50:18 gelt kernel: module_register: module g_part_gpt already exists! Jan 15 21:50:18 gelt kernel: Module g_part_gpt failed to register: 17 Jan 15 21:50:18 gelt kernel: CPU: Intel(R) Pentium(R) 4 CPU 2.40GHz (2394.05-MHz 686-class CPU) Jan 15 21:50:18 gelt kernel: Origin = "GenuineIntel" Id = 0xf41 Family = 0xf Model = 0x4 Stepping = 1 Jan 15 21:50:18 gelt kernel: Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Jan 15 21:50:18 gelt kernel: Features2=0x441d<SSE3,DTES64,MON,DS_CPL,CNXT-ID,xTPR>
Line 59 tells us the version. FreeBSD 9.3-RELEASE-p5 #0: Mon Nov 3 22:02:57 UTC 2014.
Lines 62-63 relate to something else I have encountered and fixed before.
With this information, Allan Jude and Glen Barber both tried to reproduce the issue, and failed.
We determined that the basejail (yes, we were using ezjail) on one of the servers was FreeBSD 9.2, one version behind the host:
# mount -t zfs system/usr/jails.KEEP/basejail.OLD /mnt # cd /mnt/bin # file sh sh: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), dynamically linked (uses shared libs), for FreeBSD 9.2, stripped
This explained why sshd was not running on the jails on that host. Those jails had the same problem which the i386 server had.
I think this eventually led Allan Jude to suggest checking the host which had a failed sshd. We ran the same command shown above, and it proved that the userland on that server was also FreeBSD 9.2.
The theory
The theory at this point: the previous freebsd-update from 9.2 to 9.3 was not completed. There are multiple steps and not all of them were done. Thus, we had a 9.3 kernel, but a 9.2 userland. This is a classic situation.
Thus, when freebsd-update ran last night, it thought it was upgrading from 9.3 to the latest 9.3. It wasn’t. This is how the problem arose.
The proof
Allan then attempted to duplicate my dead freebsd-update. He was successful. Then he tried to update that failed system and was again successful. For me, the solution was this update:
env UNAME_r=9.2-RELEASE freebsd-update upgrade -r 9.3-RELEASE freebsd-update install freebsd-update install freebsd-update install
NOTE: The above will fix your host if you’ve got a mixed world and userland. It’s worked for me a few times.
That got the failed host back to a fully functional state. Elapsed time, about 24 hours.
Improvements for FreeBSD
Given that three people reported the same problem on the same day, consensus was we had to improve this. Glen Barber contributed a periodic script to check for this situation.
Allan Jude has submitted patches for freebsd-update which will alert you this situation.
Improvements for my Nagios
After looking at Glen’s script, and realizing that FreeBSD 9.3 and later has new flags on uname has flags for both userland and kernel version:
-K Write the FreeBSD version of the kernel. -U Write the FreeBSD version of the user environment.
That made writing my Nagios plugins very easy. Just compare the output of the two calls. If I ever fail to complete the freebsd-update between versions again, Nagios will tell me.
The key to that plugin is:
KERNEL=`uname -K` USERLAND=`uname -U`
But host versus jail!
You’ll remember that we discovered the basejail was on a different version. If you’re doing this intentionally, that’s OK. But if you’re not, you really want to know about it. It is this situation which prevented the jails from starting sshd That’s why I created the check_host_basejail (same URL as the previous one) Nagios plugins.
The key to this plugin is:
HOST=`/usr/bin/file -b /bin/sh` JAIL=`/usr/bin/file -b /usr/jails/basejail/bin/sh`
Here’s what I found:
That second host, running an 8.2 jail on a 9.3 host, that’s intentional. Long story. Stuff which won’t be upgraded…
Basejails can be affected too
I had a problem on the knew (yes, that’s the right name) server.:
WARNING: Host and basejail are NOT in sync: host = ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), dynamically linked (uses shared libs), for FreeBSD 9.3, stripped basejail = ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), dynamically linked (uses shared libs), for FreeBSD 9.1, stripped
The solution for this was:
# zfs snapshot -r system/usr/jails@BeforeUpgradingBaseJailFrom9.1To9.3 # ezjail-admin update -s 9.1-RELEASE -U
The reason we specify -s 9.1-RELEASE is because the basejail is on 9.1-RELEASE. If we did not do that, freebsd-update would use the version supplied by uname, which is not what we want to do here.
After that, the host and basejail were in sync:
[dan@knew:/etc] $ file /bin/sh /bin/sh: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), dynamically linked (uses shared libs), for FreeBSD 9.3, stripped [dan@knew:/etc] $ file /usr/jails/basejail/bin/sh /usr/jails/basejail/bin/sh: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), dynamically linked (uses shared libs), for FreeBSD 9.3, stripped [dan@knew:/etc] $
I hit this situation today at $WORK, but it was 9.1 /bin/sh and 9.3 userland.
try this:
rm -rfv /var/empty && mkdir /var/empty
then restart ssh