Unintended shutdown of the #homelab

Today was battery replacement day at my place. The UPS batteries were replaced without downtime.

That doesn’t mean there wasn’t downtime.

I am not documenting this in real-time. This is post-event analysis.

First, I wanted to change the battery replacement date, so I ran acptest on the pfSense server, to which the UPS is connected to.

[2.4.5-RELEASE][admin@bast.int.unixathome.org]/root: apctest


2020-08-15 14:51:30 apctest 3.14.14 (31 May 2016) freebsd
Checking configuration ...
sharenet.type = Network & ShareUPS Disabled
cable.type = USB Cable
mode.type = USB UPS Driver
Setting up the port ...
Doing prep_device() ...

You are using a USB cable type, so I'm entering USB test mode
Hello, this is the apcupsd Cable Test program.
This part of apctest is for testing USB UPSes.

Getting UPS capabilities...SUCCESS

Please select the function you want to perform.

1)  Test kill UPS power
2)  Perform self-test
3)  Read last self-test result
4)  View/Change battery date
5)  View manufacturing date
6)  View/Change alarm behavior
7)  View/Change sensitivity
8)  View/Change low transfer voltage
9)  View/Change high transfer voltage
10) Perform battery calibration
11) Test alarm
12) View/Change self-test interval
 Q) Quit

Select function number: 4

Current battery date: 01/01/2017
Enter new battery date (MM/DD/YYYY), blank to quit: 08/15/2020

Writing new date...SUCCESS
Waiting for change to take effect...SUCCESS
Current battery date: 08/15/2020

So far so good. I checked the battery charge level via LibreNMS. It looked good.

Now, in hindsight, let’s go look at the graphs. I started with the graphs for another server. It looks like the power shutdown occurred sometime after 1455 and before 1456 UTC (just before 11 AM local time).

Availability graph showing server going away at about 1455 UTC
Availability graph showing server going away at about 1455 UTC

Looking in the logs, I see:

Aug 15 14:51:30 slocum apcupsd[2665]: Communications with UPS lost.
Aug 15 14:51:37 slocum apcupsd[2665]: Communications with UPS restored.
Aug 15 14:52:18 slocum apcupsd[2665]: UPS Self Test switch to battery.
Aug 15 14:52:22 slocum apcupsd[2665]: User logins prohibited
Aug 15 14:52:25 slocum apcupsd[2665]: Running on UPS batteries.
Aug 15 14:52:31 slocum apcupsd[2665]: Communications with UPS lost.
Aug 15 14:52:31 slocum apcupsd[2665]: Battery power exhausted.
Aug 15 14:52:31 slocum apcupsd[2665]: Initiating system shutdown!
Aug 15 14:52:35 slocum shutdown[59710]: halt by root: apcupsd initiated shutdown
Aug 15 14:52:35 slocum root[59770]: jail
Aug 15 14:52:35 slocum bacula-fd[61967]: Shutting down Bacula service: slocum-fd ...
Aug 15 14:52:35 slocum kernel: .
Aug 15 14:52:53 slocum apcupsd[2665]: Shutdown because loss of comm with NIS master while on batteries.
Aug 15 14:52:53 slocum apcupsd[2665]: Remote shutdown requested.
Aug 15 14:52:53 slocum apcupsd[2665]: Initiating system shutdown!
Aug 15 14:54:05 slocum rc.shutdown[63254]: 90 second watchdog timeout expired. Shutdown terminated.
Aug 15 14:54:05 slocum init[1]: /etc/rc.shutdown terminated abnormally, going to single user mode
Aug 15 14:54:05 slocum syslogd: exiting on signal 15

Based on that, let’s look at the UPS graphs for around that time.

Battery charge, 100%
Battery charge, 100%

Yep, right up until 1450, it showed 100% battery.

Based on that, I proceeded with the test.

1)  Test kill UPS power
2)  Perform self-test
3)  Read last self-test result
4)  View/Change battery date
5)  View manufacturing date
6)  View/Change alarm behavior
7)  View/Change sensitivity
8)  View/Change low transfer voltage
9)  View/Change high transfer voltage
10) Perform battery calibration
11) Test alarm
12) View/Change self-test interval
 Q) Quit

Select function number: 10

This test instructs the UPS to perform a battery calibration
operation and reports the result when the process completes.
The battery level must be at 100% and the load must be at least
10% to begin this test.

Battery level is 100% -- OK
Load level is 50% -- OK
Clearing previous self test result...CLEARED

The battery calibration should automatically end
when the battery level drops below about 25%.
This process can take minutes or hours, depending on
the size of your UPS and the load attached.

Initiating battery calibration...INITIATED

Waiting for calibration to complete...
To abort the calibration, press ENTER.


Broadcast Message from root@bast.int.unixathome.org 
        (no tty) at 14:52 UTC...

Power failure. Running on UPS batteries.

Broadcast Message from root@bast.int.unixathome.org
        (no tty) at 14:52 UTC...

UPS battery power exhausted. Doing shutdown.


Broadcast Message from root@bast.int.unixathome.org
        (no tty) at 14:52 UTC...

Beginning Shutdown Sequence                                                                               
                                                                               
*** FINAL System shutdown message from root@bast.int.unixathome.org ***      

System going down IMMEDIATELY


apcupsd initiated shutdown

Connection to bast.int.unixathome.org closed by remote host.
Connection to bast.int.unixathome.org closed.
[dan@pro02:~] $ 

Oh….

I took my time upstairs before heading down. The servers were all halted, not powered off as I expected. Just now, I changed that in the UPS configuration from Halt (the default) to Power Off. There is no value in leaving them powered on and further draining the batteries.

Downstairs, this was about 25 minutes later, I powered up the first server (mainly my development box), then the second (includes the database server).

I see now that those two servers take up less power combined than the other server (about 90TB of storage on spinning drives).

Now, as I write this, here is the battery charge graph for the past 6 hours.

Server availability showing it going away at about 1455 UTC
Server availability showing it going away at about 1455 UTC

Ooops. I even remember seeing the 100% charge on the front panel of the UPS after replacing the batteries. Hmm.

I think what happened. With the test, battery level hit 50% and things started to shutdown. Things were proceeded as expected.

At present, here is the UPS configuration.

APC      : 001,043,1035
DATE     : 2020-08-15 17:15:42 +0000  
HOSTNAME : bast.example.org
VERSION  : 3.14.14 (31 May 2016) freebsd
UPSNAME  : bast-apcupsd
CABLE    : USB Cable
DRIVER   : USB UPS Driver
UPSMODE  : Stand Alone
STARTTIME: 2020-08-15 17:15:40 +0000  
MODEL    : Smart-UPS 2200 RM 
STATUS   : ONLINE 
LINEV    : 120.2 Volts
LOADPCT  : 33.1 Percent
BCHARGE  : 100.0 Percent
TIMELEFT : 10.0 Minutes
MBATTCHG : 4 Percent
MINTIMEL : 3 Minutes
MAXTIME  : 0 Seconds
OUTPUTV  : 120.2 Volts
SENSE    : High
DWAKE    : -1 Seconds
DSHUTD   : 90 Seconds
LOTRANS  : 106.0 Volts
HITRANS  : 127.0 Volts
RETPCT   : 0.0 Percent
ITEMP    : 31.0 C
ALARMDEL : No alarm
BATTV    : 54.8 Volts
LINEFREQ : 60.0 Hz
LASTXFER : Automatic or explicit self test
NUMXFERS : 0
TONBATT  : 0 Seconds
CUMONBATT: 0 Seconds
XOFFBATT : N/A
SELFTEST : NO
STESTI   : 14 days
STATFLAG : 0x05000008
MANDATE  : 2004-04-06
SERIALNO : YS0415221267
BATTDATE : 2020-08-15
NOMOUTV  : 120 Volts
NOMBATTV : 48.0 Volts
FIRMWARE : 665.4.D USB FW:2.3
END APC  : 2020-08-15 17:15:44 +0000  

… just for future reference.

If you have your own ideas about what happened, please let me know.

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive

Leave a Comment

Scroll to Top