This afternoon, I was running an Ansible script to update the snmpd configuration settings on nagios03 (which replaced nagios02 – which was intentionally destroyed earlier today).
After the script completed, I could not access the ngaios3 website, nor could I ssh into the host. My first thoughts: I must have messed up the ssh configuration, I’ll fix it through the console.
I can’t get to anything. Anywhere. WIFI now? No, it’s me that’s down.
From the console in the basement, both servers were nearly powered off. I powered up the hosts.
I had just started hunting around, when they shutdown again.
Getting the UPS back online
It was difficult to get the UPS back online. This is what the screen looked like when I got it back online. When I was trying to get the UPS to power off, that battery was empty. 0%.
When I pressed the power button for 10 seconds, the shutdown would start, then stop and bring me back to this screen.
Eventually, with the power disconnected, I then disconnected both the primary and the second battery sources. That allowed the UPS to shutdown. During this two-minute-off period, I pressed the power button several times to drain the capacitor. I could hear the click-click-click of something each time. I stopped once pressing the button did not initiate that click.
I reconnected the batteries and applied power. The console panel lit up, as shown above.
I tried to take the UPS online by pressing the power button. It would start the power on sequence, and come back to this screen. I tried again. Same results. Perhaps on the third or fourth power on, the UPS came online.
I powered on the servers (I had to do this because I had powered them off from the HALT status earlier on).
Everything came back online.
Logs
With the hosts back online, I ssh’d in to check the logs. This is a regular shutdown -h now. Battery low? There was no power outage.
Jan 3 18:16:42 r730-03 upsmon[1826]: UPS ups02@bast.int.unixathome.org battery is low Jan 3 18:17:02 r730-03 upsmon[1826]: UPS ups02@bast.int.unixathome.org: forced shutdown in progress Jan 3 18:17:02 r730-03 upsmon[1826]: FSD set on UPS heartbeat failed: ERR ACCESS-DENIED Jan 3 18:17:02 r730-03 upsmon[1826]: Executing automatic power-fail shutdown Jan 3 18:17:02 r730-03 upsmon[1826]: Auto logout and shutdown proceeding Jan 3 18:17:08 r730-03 shutdown[27388]: halt by root: Jan 3 18:17:11 r730-03 kernel: bacula-sd-04 Jan 3 18:17:34 r730-03 upsd[47256]: Data for UPS [ups02] is stale - check driver Jan 3 18:17:47 r730-03 kernel: cliff1 Jan 3 18:17:50 r730-03 kernel: dbclone Jan 3 18:17:57 r730-03 kernel: empty Jan 3 18:18:01 r730-03 kernel: fileserver Jan 3 18:18:05 r730-03 kernel: Jan 3 18:18:05 r730-03 kernel: epair5a: promiscuous mode disabled Jan 3 18:18:05 r730-03 kernel: ix0: link state changed to DOWN Jan 3 18:18:05 r730-03 kernel: ix0: link state changed to UP Jan 3 18:18:05 r730-03 rtsold[1531]: <rtsock_input_ifannounce> interface epair5a removed Jan 3 18:18:05 r730-03 kernel: epair5a: link state changed to DOWN Jan 3 18:18:05 r730-03 kernel: epair5b: link state changed to DOWN Jan 3 18:18:05 r730-03 rtsold[1531]: <rtsock_input_ifannounce> interface epair5b removed Jan 3 18:18:07 r730-03 kernel: graylog Jan 3 18:18:09 r730-03 kernel: tm Jan 3 18:18:13 r730-03 kernel: dns2 Jan 3 18:18:16 r730-03 kernel: . Jan 3 18:18:16 r730-03 syslogd: last message repeated 1 times Jan 3 18:18:16 r730-03 nrpe[1919]: Caught SIGTERM - shutting down... Jan 3 18:18:16 r730-03 kernel: . Jan 3 18:18:16 r730-03 nrpe[1919]: Daemon shutdown Jan 3 18:18:16 r730-03 bacula-fd[1898]: Shutting down Bacula service: r730-03-fd ... Jan 3 18:18:16 r730-03 kernel: . Jan 3 18:18:16 r730-03 root[29252]: /etc/rc.shutdown: WARNING: $powerdxx_enable is not set properly - see rc.conf(5). Jan 3 18:18:16 r730-03 ntpd[1869]: ntpd exiting on signal 15 (Terminated) Jan 3 18:18:16 r730-03 kernel: . Jan 3 18:18:16 r730-03 upsd[47256]: mainloop: Interrupted system call Jan 3 18:18:20 r730-03 kernel: Waiting (max 60 seconds) for system process `vnlru' to stop... done Jan 3 18:18:20 r730-03 kernel: Waiting (max 60 seconds) for system process `syncer' to stop... Jan 3 18:18:20 r730-03 kernel: Jan 3 18:18:20 r730-03 kernel: Syncing disks, vnodes remaining... 0 Jan 3 18:18:21 r730-03 kernel: 0 Jan 3 18:18:28 r730-03 syslogd: last message repeated 7 times Jan 3 18:18:28 r730-03 kernel: done Jan 3 18:18:28 r730-03 kernel: All buffers synced. Jan 3 18:18:28 r730-03 kernel: GEOM_ELI: Device mirror/swap.eli destroyed. Jan 3 18:18:28 r730-03 kernel: GEOM_ELI: Detached mirror/swap.eli on last close.
The metrics
These screen shots are directly from LibreNMS.
The input voltage was OK, the battery voltage dropped.
I can’t interpret this one.
The time remaining on battery takes a plunge.
The power consumed by devices on the UPS seems steady.
There’s a bit of an uptick on the percentage load, but not unusual.
Frequency is fine
That looks to be a straight line down towards zero time remaining on battery.
Current didn’t vary.
Huge battery drain there… All the way to zero in less than an hour
What happened?
From the graphs:
- the batteries drained down to near zero
- load didn’t vary
- run time hit low
- software initiated a shutdown
However, after power on, the console indicates the battery is at about 86%, not close to a shutdown level.
Battery is now close to 90%.
There’s no way the batteries could have gone from zero to 90% in the time given (it’s now about 2200).
I’m not sure. I know why the shutdown occurred (“low battery”) but not why that condition was reported.
EDIT: I told one of the battery cells went bad. I should run my volt meter over each one to verify.