Today was battery replacement day at my place. The UPS batteries were replaced without downtime.
That doesn’t mean there wasn’t downtime.
I am not documenting this in real-time. This is post-event analysis.
First, I wanted to change the battery replacement date, so I ran acptest on the pfSense server, to which the UPS is connected to.
[2.4.5-RELEASE][admin@bast.int.unixathome.org]/root: apctest 2020-08-15 14:51:30 apctest 3.14.14 (31 May 2016) freebsd Checking configuration ... sharenet.type = Network & ShareUPS Disabled cable.type = USB Cable mode.type = USB UPS Driver Setting up the port ... Doing prep_device() ... You are using a USB cable type, so I'm entering USB test mode Hello, this is the apcupsd Cable Test program. This part of apctest is for testing USB UPSes. Getting UPS capabilities...SUCCESS Please select the function you want to perform. 1) Test kill UPS power 2) Perform self-test 3) Read last self-test result 4) View/Change battery date 5) View manufacturing date 6) View/Change alarm behavior 7) View/Change sensitivity 8) View/Change low transfer voltage 9) View/Change high transfer voltage 10) Perform battery calibration 11) Test alarm 12) View/Change self-test interval Q) Quit Select function number: 4 Current battery date: 01/01/2017 Enter new battery date (MM/DD/YYYY), blank to quit: 08/15/2020 Writing new date...SUCCESS Waiting for change to take effect...SUCCESS Current battery date: 08/15/2020
So far so good. I checked the battery charge level via LibreNMS. It looked good.
Now, in hindsight, let’s go look at the graphs. I started with the graphs for another server. It looks like the power shutdown occurred sometime after 1455 and before 1456 UTC (just before 11 AM local time).
Looking in the logs, I see:
Aug 15 14:51:30 slocum apcupsd[2665]: Communications with UPS lost. Aug 15 14:51:37 slocum apcupsd[2665]: Communications with UPS restored. Aug 15 14:52:18 slocum apcupsd[2665]: UPS Self Test switch to battery. Aug 15 14:52:22 slocum apcupsd[2665]: User logins prohibited Aug 15 14:52:25 slocum apcupsd[2665]: Running on UPS batteries. Aug 15 14:52:31 slocum apcupsd[2665]: Communications with UPS lost. Aug 15 14:52:31 slocum apcupsd[2665]: Battery power exhausted. Aug 15 14:52:31 slocum apcupsd[2665]: Initiating system shutdown! Aug 15 14:52:35 slocum shutdown[59710]: halt by root: apcupsd initiated shutdown Aug 15 14:52:35 slocum root[59770]: jail Aug 15 14:52:35 slocum bacula-fd[61967]: Shutting down Bacula service: slocum-fd ... Aug 15 14:52:35 slocum kernel: . Aug 15 14:52:53 slocum apcupsd[2665]: Shutdown because loss of comm with NIS master while on batteries. Aug 15 14:52:53 slocum apcupsd[2665]: Remote shutdown requested. Aug 15 14:52:53 slocum apcupsd[2665]: Initiating system shutdown! Aug 15 14:54:05 slocum rc.shutdown[63254]: 90 second watchdog timeout expired. Shutdown terminated. Aug 15 14:54:05 slocum init[1]: /etc/rc.shutdown terminated abnormally, going to single user mode Aug 15 14:54:05 slocum syslogd: exiting on signal 15
Based on that, let’s look at the UPS graphs for around that time.
Yep, right up until 1450, it showed 100% battery.
Based on that, I proceeded with the test.
1) Test kill UPS power 2) Perform self-test 3) Read last self-test result 4) View/Change battery date 5) View manufacturing date 6) View/Change alarm behavior 7) View/Change sensitivity 8) View/Change low transfer voltage 9) View/Change high transfer voltage 10) Perform battery calibration 11) Test alarm 12) View/Change self-test interval Q) Quit Select function number: 10 This test instructs the UPS to perform a battery calibration operation and reports the result when the process completes. The battery level must be at 100% and the load must be at least 10% to begin this test. Battery level is 100% -- OK Load level is 50% -- OK Clearing previous self test result...CLEARED The battery calibration should automatically end when the battery level drops below about 25%. This process can take minutes or hours, depending on the size of your UPS and the load attached. Initiating battery calibration...INITIATED Waiting for calibration to complete... To abort the calibration, press ENTER. Broadcast Message from root@bast.int.unixathome.org (no tty) at 14:52 UTC... Power failure. Running on UPS batteries. Broadcast Message from root@bast.int.unixathome.org (no tty) at 14:52 UTC... UPS battery power exhausted. Doing shutdown. Broadcast Message from root@bast.int.unixathome.org (no tty) at 14:52 UTC... Beginning Shutdown Sequence *** FINAL System shutdown message from root@bast.int.unixathome.org *** System going down IMMEDIATELY apcupsd initiated shutdown Connection to bast.int.unixathome.org closed by remote host. Connection to bast.int.unixathome.org closed. [dan@pro02:~] $
Oh….
I took my time upstairs before heading down. The servers were all halted, not powered off as I expected. Just now, I changed that in the UPS configuration from Halt (the default) to Power Off. There is no value in leaving them powered on and further draining the batteries.
Downstairs, this was about 25 minutes later, I powered up the first server (mainly my development box), then the second (includes the database server).
I see now that those two servers take up less power combined than the other server (about 90TB of storage on spinning drives).
Now, as I write this, here is the battery charge graph for the past 6 hours.
Ooops. I even remember seeing the 100% charge on the front panel of the UPS after replacing the batteries. Hmm.
I think what happened. With the test, battery level hit 50% and things started to shutdown. Things were proceeded as expected.
At present, here is the UPS configuration.
APC : 001,043,1035 DATE : 2020-08-15 17:15:42 +0000 HOSTNAME : bast.example.org VERSION : 3.14.14 (31 May 2016) freebsd UPSNAME : bast-apcupsd CABLE : USB Cable DRIVER : USB UPS Driver UPSMODE : Stand Alone STARTTIME: 2020-08-15 17:15:40 +0000 MODEL : Smart-UPS 2200 RM STATUS : ONLINE LINEV : 120.2 Volts LOADPCT : 33.1 Percent BCHARGE : 100.0 Percent TIMELEFT : 10.0 Minutes MBATTCHG : 4 Percent MINTIMEL : 3 Minutes MAXTIME : 0 Seconds OUTPUTV : 120.2 Volts SENSE : High DWAKE : -1 Seconds DSHUTD : 90 Seconds LOTRANS : 106.0 Volts HITRANS : 127.0 Volts RETPCT : 0.0 Percent ITEMP : 31.0 C ALARMDEL : No alarm BATTV : 54.8 Volts LINEFREQ : 60.0 Hz LASTXFER : Automatic or explicit self test NUMXFERS : 0 TONBATT : 0 Seconds CUMONBATT: 0 Seconds XOFFBATT : N/A SELFTEST : NO STESTI : 14 days STATFLAG : 0x05000008 MANDATE : 2004-04-06 SERIALNO : YS0415221267 BATTDATE : 2020-08-15 NOMOUTV : 120 Volts NOMBATTV : 48.0 Volts FIRMWARE : 665.4.D USB FW:2.3 END APC : 2020-08-15 17:15:44 +0000
… just for future reference.
If you have your own ideas about what happened, please let me know.