Complete network shutdown – wtf?

This afternoon, I was running an Ansible script to update the snmpd configuration settings on nagios03 (which replaced nagios02 – which was intentionally destroyed earlier today).

After the script completed, I could not access the ngaios3 website, nor could I ssh into the host. My first thoughts: I must have messed up the ssh configuration, I’ll fix it through the console.

I can’t get to anything. Anywhere. WIFI now? No, it’s me that’s down.

From the console in the basement, both servers were nearly powered off. I powered up the hosts.

I had just started hunting around, when they shutdown again.

Getting the UPS back online

It was difficult to get the UPS back online. This is what the screen looked like when I got it back online. When I was trying to get the UPS to power off, that battery was empty. 0%.

When I pressed the power button for 10 seconds, the shutdown would start, then stop and bring me back to this screen.

Eventually, with the power disconnected, I then disconnected both the primary and the second battery sources. That allowed the UPS to shutdown. During this two-minute-off period, I pressed the power button several times to drain the capacitor. I could hear the click-click-click of something each time. I stopped once pressing the button did not initiate that click.

I reconnected the batteries and applied power. The console panel lit up, as shown above.

I tried to take the UPS online by pressing the power button. It would start the power on sequence, and come back to this screen. I tried again. Same results. Perhaps on the third or fourth power on, the UPS came online.

I powered on the servers (I had to do this because I had powered them off from the HALT status earlier on).

Everything came back online.

Logs

With the hosts back online, I ssh’d in to check the logs. This is a regular shutdown -h now. Battery low? There was no power outage.

Jan  3 18:16:42 r730-03 upsmon[1826]: UPS ups02@bast.int.unixathome.org battery is low
Jan  3 18:17:02 r730-03 upsmon[1826]: UPS ups02@bast.int.unixathome.org: forced shutdown in progress
Jan  3 18:17:02 r730-03 upsmon[1826]: FSD set on UPS heartbeat failed: ERR ACCESS-DENIED
Jan  3 18:17:02 r730-03 upsmon[1826]: Executing automatic power-fail shutdown
Jan  3 18:17:02 r730-03 upsmon[1826]: Auto logout and shutdown proceeding
Jan  3 18:17:08 r730-03 shutdown[27388]: halt by root: 
Jan  3 18:17:11 r730-03 kernel:  bacula-sd-04
Jan  3 18:17:34 r730-03 upsd[47256]: Data for UPS [ups02] is stale - check driver
Jan  3 18:17:47 r730-03 kernel:  cliff1
Jan  3 18:17:50 r730-03 kernel:  dbclone
Jan  3 18:17:57 r730-03 kernel:  empty
Jan  3 18:18:01 r730-03 kernel:  fileserver
Jan  3 18:18:05 r730-03 kernel: 
Jan  3 18:18:05 r730-03 kernel: epair5a: promiscuous mode disabled
Jan  3 18:18:05 r730-03 kernel: ix0: link state changed to DOWN
Jan  3 18:18:05 r730-03 kernel: ix0: link state changed to UP
Jan  3 18:18:05 r730-03 rtsold[1531]: <rtsock_input_ifannounce> interface epair5a removed
Jan  3 18:18:05 r730-03 kernel: epair5a: link state changed to DOWN
Jan  3 18:18:05 r730-03 kernel: epair5b: link state changed to DOWN
Jan  3 18:18:05 r730-03 rtsold[1531]: <rtsock_input_ifannounce> interface epair5b removed
Jan  3 18:18:07 r730-03 kernel:  graylog
Jan  3 18:18:09 r730-03 kernel:  tm
Jan  3 18:18:13 r730-03 kernel:  dns2
Jan  3 18:18:16 r730-03 kernel: .
Jan  3 18:18:16 r730-03 syslogd: last message repeated 1 times
Jan  3 18:18:16 r730-03 nrpe[1919]: Caught SIGTERM - shutting down...
Jan  3 18:18:16 r730-03 kernel: .
Jan  3 18:18:16 r730-03 nrpe[1919]: Daemon shutdown
Jan  3 18:18:16 r730-03 bacula-fd[1898]: Shutting down Bacula service: r730-03-fd ...
Jan  3 18:18:16 r730-03 kernel: .
Jan  3 18:18:16 r730-03 root[29252]: /etc/rc.shutdown: WARNING: $powerdxx_enable is not set properly - see rc.conf(5).
Jan  3 18:18:16 r730-03 ntpd[1869]: ntpd exiting on signal 15 (Terminated)
Jan  3 18:18:16 r730-03 kernel: .
Jan  3 18:18:16 r730-03 upsd[47256]: mainloop: Interrupted system call
Jan  3 18:18:20 r730-03 kernel: Waiting (max 60 seconds) for system process `vnlru' to stop... done
Jan  3 18:18:20 r730-03 kernel: Waiting (max 60 seconds) for system process `syncer' to stop... 
Jan  3 18:18:20 r730-03 kernel: 
Jan  3 18:18:20 r730-03 kernel: Syncing disks, vnodes remaining... 0 
Jan  3 18:18:21 r730-03 kernel: 0 
Jan  3 18:18:28 r730-03 syslogd: last message repeated 7 times
Jan  3 18:18:28 r730-03 kernel: done
Jan  3 18:18:28 r730-03 kernel: All buffers synced.
Jan  3 18:18:28 r730-03 kernel: GEOM_ELI: Device mirror/swap.eli destroyed.
Jan  3 18:18:28 r730-03 kernel: GEOM_ELI: Detached mirror/swap.eli on last close.

The metrics

These screen shots are directly from LibreNMS.

The input voltage was OK, the battery voltage dropped.

I can’t interpret this one.

The time remaining on battery takes a plunge.

The power consumed by devices on the UPS seems steady.

There’s a bit of an uptick on the percentage load, but not unusual.

Frequency is fine

That looks to be a straight line down towards zero time remaining on battery.

Current didn’t vary.

Huge battery drain there… All the way to zero in less than an hour

What happened?

From the graphs:

  1. the batteries drained down to near zero
  2. load didn’t vary
  3. run time hit low
  4. software initiated a shutdown

However, after power on, the console indicates the battery is at about 86%, not close to a shutdown level.

Battery is now close to 90%.

There’s no way the batteries could have gone from zero to 90% in the time given (it’s now about 2200).

I’m not sure. I know why the shutdown occurred (“low battery”) but not why that condition was reported.

EDIT: I told one of the battery cells went bad. I should run my volt meter over each one to verify.

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive

Leave a Comment

Scroll to Top