Dec 142016
 

The knew server is ‘frozen’ again. This has been happening daily at about O301 UTC each night. See my Twitter feed for background.

In this post I will include details as I progress through the data. The server in question is knew (yes, that’s the hostname).

dtrace hotkernel

I left this running in an ssh session and pressed control-C this morning:

[root@knew:/usr/share/dtrace/toolkit] #  ./hotkernel >> /var/tmp/hotkernel
^C

ssh login loop

It was suggested I try an ssh-login loop. I created this script and left it running over night:

while : ; do ssh knew w; sleep 5; done

I left that running in a tmux script on my laptop. This morning, it looks like this:

dan        pts/2    dent.int.unixathome.org  Tue06PM  7:51 /usr/sbin/dtrace -n
root       v0       -                        Tue07PM  7:11 -csh (csh)
 3:04AM  up 12:44, 3 users, load averages: 2.32, 1.43, 1.21
USER       TTY      FROM                      LOGIN@  IDLE WHAT
dan        pts/0    tmux(71172).%0            2:59AM     5 sleep 1
dan        pts/2    dent.int.unixathome.org  Tue06PM  7:52 /usr/sbin/dtrace -n
root       v0       -                        Tue07PM  7:11 -csh (csh)
 3:04AM  up 12:44, 3 users, load averages: 2.45, 1.48, 1.23
USER       TTY      FROM                      LOGIN@  IDLE WHAT
dan        pts/0    tmux(71172).%0            2:59AM     5 sleep 1
dan        pts/2    dent.int.unixathome.org  Tue06PM  7:52 /usr/sbin/dtrace -n
root       v0       -                        Tue07PM  7:11 -csh (csh)
 3:04AM  up 12:44, 3 users, load averages: 3.06, 1.62, 1.28
USER       TTY      FROM                      LOGIN@  IDLE WHAT
dan        pts/0    tmux(71172).%0            2:59AM     5 sleep 1
dan        pts/2    dent.int.unixathome.org  Tue06PM  7:52 /usr/sbin/dtrace -n
root       v0       -                        Tue07PM  7:11 -csh (csh)
 3:05AM  up 12:44, 3 users, load averages: 2.89, 1.61, 1.28
USER       TTY      FROM                      LOGIN@  IDLE WHAT
dan        pts/0    tmux(71172).%0            2:59AM     6 sleep 1
dan        pts/2    dent.int.unixathome.org  Tue06PM  7:52 /usr/sbin/dtrace -n
root       v0       -                        Tue07PM  7:11 -csh (csh)

Checking my laptop via ps auwwx for ssh, I found (output trimmed for clarity):

$ ps auwx | grep ssh
dan              13096   0.0  0.0  2465364   2948 s012  S+    1:59PM   0:00.06 ssh -A knew.int.unixathome.org

That’s not the session from above. That’s my ssh session for dtrace, below.

I see no evidence of any of that ssh-loop script running on my laptop, despite it being in a tmux script.

The last email notice from the frozen server indicates the last login was at 03:01:57 UTC.

Nagios status

Picking a jail at random, Nagios is reporting CHECK_NRPE: Error – Could not complete SSL handshake for services which are checked remotely via nrpe on the frozen server.

ping is OK.

ssh status

Attempting to ssh to one of the jails:

$ telnet ansible 22
Trying 10.55.0.113...
Connected to ansible.int.unixathome.org.
Escape character is '^]'.
SetSockOpt: Invalid argument
Connection closed by foreign host.
$ ssh -vvvv ansible
OpenSSH_6.9p1, LibreSSL 2.1.8
debug1: Reading configuration data /Users/dan/.ssh/config
debug1: /Users/dan/.ssh/config line 328: Applying options for ansible
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 20: Applying options for *
debug1: /etc/ssh/ssh_config line 102: Applying options for *
debug2: ssh_connect: needpriv 0
debug1: Connecting to ansible.int.unixathome.org [10.55.0.113] port 22.
debug1: Connection established.
debug1: identity file /Users/dan/.ssh/id_rsa type 1
debug1: key_load_public: No such file or directory
debug1: identity file /Users/dan/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /Users/dan/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /Users/dan/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /Users/dan/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /Users/dan/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /Users/dan/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /Users/dan/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
write: Broken pipe

bacula-sd is active

The bacula-sd daemon in a jail is answering. Nagios reports:

TCP OK - 0.001 second response time on bacula-sd-01.int.unixathome.org port 9103 

There are two bacula-sd running on this host, both in different jails, each still answering on port 9103.

bacula-fd is active

There are several bacula-fd daemons running in jails on this host. They are active and answering. Here is the Nagios report:

TCP OK - 0.001 second response time on dbclone.int.unixathome.org port 9102 

Debugger invoked

I pressed CTL-ALT-ESC on the console to invoke the debugger. Then I invoked doadump.

See Twitter post for screen shot.

no dump

There is no dump to be obtained.

errno 6 ENXIO means the dump did not work. The cause: There is no dumpdev specified in /etc/rc.conf.

After I reboot the server, I will add that directive.

Ironically, I verified that dumpdev was not specified in /etc/rc.conf by restoring from a backup. That backup is resident on the frozen server. I can restore a backup from the frozen server, but I can’t look at the actual file on the server. FYI, the file was restored from the frozen server to another server.

I admire how Bacula continues to work but everything else on the box is frozen.

Getting dumps to work

I recompiled my kernel to include:

include GENERIC
ident KNEW

options KDB
options DDB
options BREAK_TO_DEBUGGER
options WITNESS

This server uses an encrypted swap partition. We tried, and failed to figure out why it would not work:

$ sudo savecore /tank_data/crash /dev/mirror/swap.eli
savecore: error reading last dump header at offset 8589926400 in /dev/mirror/swap.eli: Invalid argument
savecore: no dumps found

Solution: amend /etc/fstab and reboot:

$ cat /etc/fstab
#system/rootfs        /    zfs  rw,noatime 0 0
#/dev/mirror/swap.eli none swap sw         0 0
/dev/mirror/swap none swap sw         0 0

But before rebooting, issue this command: sudo sysrc dumpdev=”/dev/mirror/swap”

Now I can dump (see photo).

I can also savecore:

$ sudo savecore /tank_data/crash 
savecore: reboot
savecore: writing core to /tank_data/crash/vmcore.2

Let’s see what happens tomorrow morning!

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive