The knew server is ‘frozen’ again. This has been happening daily at about O301 UTC each night. See my Twitter feed for background.
In this post I will include details as I progress through the data. The server in question is knew (yes, that’s the hostname).
dtrace hotkernel
I left this running in an ssh session and pressed control-C this morning:
[root@knew:/usr/share/dtrace/toolkit] # ./hotkernel >> /var/tmp/hotkernel ^C
ssh login loop
It was suggested I try an ssh-login loop. I created this script and left it running over night:
while : ; do ssh knew w; sleep 5; done
I left that running in a tmux script on my laptop. This morning, it looks like this:
dan pts/2 dent.int.unixathome.org Tue06PM 7:51 /usr/sbin/dtrace -n root v0 - Tue07PM 7:11 -csh (csh) 3:04AM up 12:44, 3 users, load averages: 2.32, 1.43, 1.21 USER TTY FROM LOGIN@ IDLE WHAT dan pts/0 tmux(71172).%0 2:59AM 5 sleep 1 dan pts/2 dent.int.unixathome.org Tue06PM 7:52 /usr/sbin/dtrace -n root v0 - Tue07PM 7:11 -csh (csh) 3:04AM up 12:44, 3 users, load averages: 2.45, 1.48, 1.23 USER TTY FROM LOGIN@ IDLE WHAT dan pts/0 tmux(71172).%0 2:59AM 5 sleep 1 dan pts/2 dent.int.unixathome.org Tue06PM 7:52 /usr/sbin/dtrace -n root v0 - Tue07PM 7:11 -csh (csh) 3:04AM up 12:44, 3 users, load averages: 3.06, 1.62, 1.28 USER TTY FROM LOGIN@ IDLE WHAT dan pts/0 tmux(71172).%0 2:59AM 5 sleep 1 dan pts/2 dent.int.unixathome.org Tue06PM 7:52 /usr/sbin/dtrace -n root v0 - Tue07PM 7:11 -csh (csh) 3:05AM up 12:44, 3 users, load averages: 2.89, 1.61, 1.28 USER TTY FROM LOGIN@ IDLE WHAT dan pts/0 tmux(71172).%0 2:59AM 6 sleep 1 dan pts/2 dent.int.unixathome.org Tue06PM 7:52 /usr/sbin/dtrace -n root v0 - Tue07PM 7:11 -csh (csh)
Checking my laptop via ps auwwx for ssh, I found (output trimmed for clarity):
$ ps auwx | grep ssh dan 13096 0.0 0.0 2465364 2948 s012 S+ 1:59PM 0:00.06 ssh -A knew.int.unixathome.org
That’s not the session from above. That’s my ssh session for dtrace, below.
I see no evidence of any of that ssh-loop script running on my laptop, despite it being in a tmux script.
The last email notice from the frozen server indicates the last login was at 03:01:57 UTC.
Nagios status
Picking a jail at random, Nagios is reporting CHECK_NRPE: Error – Could not complete SSL handshake for services which are checked remotely via nrpe on the frozen server.
ping is OK.
ssh status
Attempting to ssh to one of the jails:
$ telnet ansible 22 Trying 10.55.0.113... Connected to ansible.int.unixathome.org. Escape character is '^]'. SetSockOpt: Invalid argument Connection closed by foreign host.
$ ssh -vvvv ansible OpenSSH_6.9p1, LibreSSL 2.1.8 debug1: Reading configuration data /Users/dan/.ssh/config debug1: /Users/dan/.ssh/config line 328: Applying options for ansible debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 20: Applying options for * debug1: /etc/ssh/ssh_config line 102: Applying options for * debug2: ssh_connect: needpriv 0 debug1: Connecting to ansible.int.unixathome.org [10.55.0.113] port 22. debug1: Connection established. debug1: identity file /Users/dan/.ssh/id_rsa type 1 debug1: key_load_public: No such file or directory debug1: identity file /Users/dan/.ssh/id_rsa-cert type -1 debug1: key_load_public: No such file or directory debug1: identity file /Users/dan/.ssh/id_dsa type -1 debug1: key_load_public: No such file or directory debug1: identity file /Users/dan/.ssh/id_dsa-cert type -1 debug1: key_load_public: No such file or directory debug1: identity file /Users/dan/.ssh/id_ecdsa type -1 debug1: key_load_public: No such file or directory debug1: identity file /Users/dan/.ssh/id_ecdsa-cert type -1 debug1: key_load_public: No such file or directory debug1: identity file /Users/dan/.ssh/id_ed25519 type -1 debug1: key_load_public: No such file or directory debug1: identity file /Users/dan/.ssh/id_ed25519-cert type -1 debug1: Enabling compatibility mode for protocol 2.0 write: Broken pipe
bacula-sd is active
The bacula-sd daemon in a jail is answering. Nagios reports:
TCP OK - 0.001 second response time on bacula-sd-01.int.unixathome.org port 9103
There are two bacula-sd running on this host, both in different jails, each still answering on port 9103.
bacula-fd is active
There are several bacula-fd daemons running in jails on this host. They are active and answering. Here is the Nagios report:
TCP OK - 0.001 second response time on dbclone.int.unixathome.org port 9102
Debugger invoked
I pressed CTL-ALT-ESC on the console to invoke the debugger. Then I invoked doadump.
See Twitter post for screen shot.
no dump
There is no dump to be obtained.
errno 6 ENXIO means the dump did not work. The cause: There is no dumpdev specified in /etc/rc.conf.
After I reboot the server, I will add that directive.
Ironically, I verified that dumpdev was not specified in /etc/rc.conf by restoring from a backup. That backup is resident on the frozen server. I can restore a backup from the frozen server, but I can’t look at the actual file on the server. FYI, the file was restored from the frozen server to another server.
I admire how Bacula continues to work but everything else on the box is frozen.
Getting dumps to work
I recompiled my kernel to include:
include GENERIC ident KNEW options KDB options DDB options BREAK_TO_DEBUGGER options WITNESS
This server uses an encrypted swap partition. We tried, and failed to figure out why it would not work:
$ sudo savecore /tank_data/crash /dev/mirror/swap.eli savecore: error reading last dump header at offset 8589926400 in /dev/mirror/swap.eli: Invalid argument savecore: no dumps found
Solution: amend /etc/fstab and reboot:
$ cat /etc/fstab #system/rootfs / zfs rw,noatime 0 0 #/dev/mirror/swap.eli none swap sw 0 0 /dev/mirror/swap none swap sw 0 0
But before rebooting, issue this command: sudo sysrc dumpdev=”/dev/mirror/swap”
Now I can dump (see photo).
I can also savecore:
$ sudo savecore /tank_data/crash savecore: reboot savecore: writing core to /tank_data/crash/vmcore.2
Let’s see what happens tomorrow morning!