Today I tested the shutdown but was mostly concerned by the startup. I wanted the hosts to start in a specific order. I lived tweeted it and that thread is the basis for this post.
The UPS has three groups of outlets: primary group, group 1, group 2. Each group has a On Delay setting. You could choose which gear to power on first. I have all mine set to zero, so all outlets power on when the UPS comes online.
I will be using my APC PDUs to control the power-on order.
PDU configuration
These PDUs are old. Both the ssh and the https interfaces use ciphers which modern clients refuse to use.
$ ssh -oKexAlgorithms=+diffie-hellman-group-exchange-sha1 -oCiphers=+aes256-cbc pdu1 no matching key exchange method found. Their offer: diffie-hellman-group1-sha1 $ ssh -oKexAlgorithms=+ diffie-hellman-group1-sha1 -oCiphers=+aes256-cbc pdu1 next:no matching cipher found. Their offer: 3des-cbc,blowfish-cbc $ ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 -oCiphers=+3des-cbc pdu1 ssh_ssh_dispatch_run_fatal: Connection to 10.52.0.2 port 22: Invalid key length
This idea came from debdrup:
setup a jail with misc/compatNx for whichever major release has the older SSH client you need.
Longer term, I like that.
For today, it’s a serial cable.
My servers each have 2 power supplies. I’ll plug each one into a different PDU.
I’ll configuring the power-on delay of each outlet so that r710-01 powers on first, then knew, followed by slocum.
The firewall and switches are all connected to the same non-managed PDU at the top of the rack. It powers up first. The firewall will be operational before any other host.
BIOS – power up or last state
Most system BIOS allow you to configure what happens when power is supplied to the host. Last state and power on are two common settings. Last state means if it was off, leave it off. If it was on, boot. I prefer power on. It also allows the power cycling of a managed outlet and have the host power up.
upsmon.conf – FINALDELAY
During the test of the shutdown process, I determined that FINALDELAY needs to be 210 seconds to give the hosts enough time to shutdown before powering off the UPS. That settings goes into upsmon.conf. NOTE: I later remove this during my testing.
upsmon.conf – SHUDOWNCMD
I have changed this setting in /usr/local/etc/nut/upsmon.conf on each host:
old: SHUTDOWNCMD “/sbin/shutdown -h +0”
new: SHUTDOWNCMD “/sbin/shutdown -p now”
I notice now that on my pfSense box, it still uses the old command. It works fine. Given that this is a power saving issue, pfSense is not a power issue; it is a APU2 (photos).
I want the hosts to power off, not sit idle until the UPS shuts down. Depending upon the BIOS reconfiguration, I may have to revert. NOTE: revert was not done.
Test 1 power down
The command, as used in https://dan.langille.org/2020/09/10/nut-testing-the-shutdown-mechanism/…, to simulate battery is low, shutdown now is:
/usr/local/sbin/upsmon -c fsd
I issued that command on the pfSense box because that is the nut primary.
I timed the things I could see happening and this is the summary:
+ 8 s notices on hosts + 50 r720 connection drops + 25 knew connection drops + 18 slocum connection drops + 21 knew powers off + 55 slocum powers off + 8 r720 powers off + 35 bast starts shutdown + 20 UPS stops replying to ping +120? lights power off on rack as UPS shuts down
From the UPS logs:
2020/09/12 15:02:02 Master system shutdown is activated
From pfSense (the nut primary):
Sep 12 14:58:06 bast upsmon[50050]: Signal 30: User requested FSD Sep 12 14:58:06 bast upsd[50219]: Client local-monitor@::1 set FSD on UPS [ups02] Sep 12 14:58:06 bast upsmon[50050]: FSD set on UPS heartbeat failed: ERR ACCESS-DENIED Sep 12 14:58:14 bast sshd[60370]: Connection closed by 2001:470:8abf:7055:b6f9:d572:6622:ea2d port 15668 [preauth] Sep 12 14:58:16 bast upsmon[50050]: Executing automatic power-fail shutdown Sep 12 14:58:16 bast upsmon[50050]: Auto logout and shutdown proceeding Sep 12 14:59:26 bast upsd[50219]: write() failed for 2001:470:8abf:7055:21b:21ff:fe39:a9a1: Connection reset by peer Sep 12 14:59:26 bast upsd[50219]: write() failed for 2001:470:8abf:7055:21b:21ff:fe39:a9a1: Broken pipe Sep 12 15:01:46 bast shutdown: halt by root: Sep 12 15:01:55 bast syslogd: exiting on signal 15
on knew:
Sep 12 14:58:10 knew upsmon[93843]: UPS ups02@bast.int.unixathome.org: forced shutdown in progress Sep 12 14:58:10 knew upsmon[93843]: FSD set on UPS heartbeat failed: ERR ACCESS-DENIED Sep 12 14:58:10 knew upsmon[93843]: Executing automatic power-fail shutdown Sep 12 14:58:10 knew upsmon[93843]: Auto logout and shutdown proceeding Sep 12 14:58:15 knew shutdown[97618]: power-down by dan: Sep 12 14:58:15 knew bacula-fd[34545]: Shutting down Bacula service: knew-fd ... Sep 12 14:58:43 knew kernel: pid 16101 (bacula-sd), jid 8, uid 910: exited on signal 10 Sep 12 14:59:22 knew kernel: vnet0.1: link state changed to DOWN Sep 12 14:59:22 knew kernel: epair0b: link state changed to DOWN Sep 12 14:59:23 knew kernel: ix0: link state changed to DOWN Sep 12 14:59:23 knew kernel: ix0: link state changed to UP Sep 12 14:59:24 knew nrpe[33429]: Caught SIGTERM - shutting down... Sep 12 14:59:24 knew kernel: , 2285. Sep 12 14:59:24 knew nrpe[33429]: Daemon shutdown Sep 12 14:59:25 knew kernel: , 33429. Sep 12 14:59:25 knew kernel: . Sep 12 14:59:25 knew upsd[3403]: mainloop: Interrupted system call Sep 12 14:59:25 knew ntpd[2149]: ntpd exiting on signal 15 (Terminated) Sep 12 14:59:26 knew kernel: , 2149. Sep 12 14:59:27 knew syslogd: exiting on signal 15
On slocum:
Sep 12 14:49:21 slocum upsmon[78899]: upsmon parent: read Sep 12 14:49:21 slocum upsmon[99064]: Login on UPS [heartbeat] failed - got [ERR ACCESS-DENIED] Sep 12 14:58:10 slocum upsmon[99064]: UPS ups02@bast.int.unixathome.org: forced shutdown in progress Sep 12 14:58:10 slocum upsmon[99064]: FSD set on UPS heartbeat failed: ERR ACCESS-DENIED Sep 12 14:58:10 slocum upsmon[99064]: Executing automatic power-fail shutdown Sep 12 14:58:10 slocum upsmon[99064]: Auto logout and shutdown proceeding Sep 12 14:58:16 slocum shutdown[17983]: power-down by dan: Sep 12 14:58:16 slocum root[18041]: jail Sep 12 14:58:16 slocum bacula-fd[62870]: Shutting down Bacula service: slocum-fd ... Sep 12 14:58:16 slocum kernel: . Sep 12 14:59:46 slocum rc.shutdown[22960]: 90 second watchdog timeout expired. Shutdown terminated. Sep 12 14:59:46 slocum init[1]: /etc/rc.shutdown terminated abnormally, going to single user mode Sep 12 14:59:46 slocum syslogd: exiting on signal 15 Sep 12 15:13:23 slocum syslogd: kernel boot file is /boot/kernel/kernel
On the r720:
Sep 12 14:49:21 r720-01 upsmon[69359]: upsmon parent: read Sep 12 14:49:21 r720-01 upsmon[3348]: Login on UPS [heartbeat] failed - got [ERR ACCESS-DENIED] Sep 12 14:58:10 r720-01 upsmon[3348]: UPS ups02@bast.int.unixathome.org: forced shutdown in progress Sep 12 14:58:10 r720-01 upsmon[3348]: FSD set on UPS heartbeat failed: ERR ACCESS-DENIED Sep 12 14:58:10 r720-01 upsmon[3348]: Executing automatic power-fail shutdown Sep 12 14:58:10 r720-01 upsmon[3348]: Auto logout and shutdown proceeding Sep 12 14:58:15 r720-01 shutdown[8190]: power-down by dan: Sep 12 14:58:16 r720-01 bacula-fd[12842]: Shutting down Bacula service: r720-01-fd ... Sep 12 14:58:23 r720-01 kernel: vnet0.3: link state changed to DOWN Sep 12 14:58:23 r720-01 kernel: epair0b: link state changed to DOWN Sep 12 14:58:24 r720-01 kernel: in6_purgeaddr: err=65, destination address delete failed Sep 12 14:58:27 r720-01 kernel: vnet0.4: link state changed to DOWN Sep 12 14:58:27 r720-01 kernel: epair0b: link state changed to DOWN Sep 12 14:58:28 r720-01 kernel: in6_purgeaddr: err=65, destination address delete failed Sep 12 14:58:39 r720-01 kernel: vnet0.5: link state changed to DOWN Sep 12 14:58:39 r720-01 kernel: epair0b: link state changed to DOWN Sep 12 14:58:40 r720-01 kernel: ix0: link state changed to DOWN Sep 12 14:58:40 r720-01 kernel: WARNING: Intel (R) Network Connections are quality tested using Intel (R) Ethernet Optics. Using untested modules is not supported and may cause unstable operation or damage to the module or the adapter. Intel Corporation is not responsible for any harm caused by using untested modules. Sep 12 14:58:40 r720-01 kernel: in6_purgeaddr: err=65, destination address delete failed Sep 12 14:58:40 r720-01 kernel: WARNING: Intel (R) Network Connections are quality tested using Intel (R) Ethernet Optics. Using untested modules is not supported and may cause unstable operation or damage to the module or the adapter. Intel Corporation is not responsible for any harm caused by using untested modules. Sep 12 14:58:40 r720-01 syslogd: last message repeated 1 times Sep 12 14:58:41 r720-01 kernel: ix0: link state changed to UP Sep 12 14:58:53 r720-01 kernel: , 2433. Sep 12 14:58:53 r720-01 nrpe[2417]: Caught SIGTERM - shutting down... Sep 12 14:58:53 r720-01 nrpe[2417]: Daemon shutdown Sep 12 14:58:54 r720-01 kernel: , 2417. Sep 12 14:58:54 r720-01 kernel: . Sep 12 14:58:54 r720-01 upsd[69306]: mainloop: Interrupted system call Sep 12 14:58:54 r720-01 ntpd[2276]: ntpd exiting on signal 15 (Terminated) Sep 12 14:58:56 r720-01 kernel: , 2276. Sep 12 14:58:57 r720-01 kernel: , 2221. Sep 12 14:58:58 r720-01 kernel: , 2218. Sep 12 14:58:59 r720-01 kernel: , 2215. Sep 12 14:59:00 r720-01 kernel: , 2208. Sep 12 14:59:00 r720-01 syslogd: exiting on signal 15 Sep 12 15:45:53 r720-01 syslogd: kernel boot file is /boot/kernel/kernel
Yeah, I’m using FiberStore optics in there. They work fine, apart from those message. My post to the FreeBSD network mailing list confirms this is expected.
Test 1 power up
I powered up the UPS, the rack came back to life, but in the wrong order.
I missed a step. I needed to alter what was plugged into what outlet on the PDU. I had servers screaming because they’re on only one PSU. ;)
Once they stopped screaming, I moved the cables.
Both knew and r720-01 needed power-on set in BIOS. They were both last-state.
Configuration 1 changes
After what seemed like a 2 minute delay on the UPS powering off, I found a setting in its GUI which showed me why.
Changed:
Remaining time 120->300 Capacity under: 0 -> 20 Shutdown Duration: 120->240 Restart if capacity exceed: 0->30%
I removed the FINALDELAY from upsmon.conf on the nut primary.
This is moving the UPS power off logic into the UPS and away from nut. Based on logs, nut wasn’t doing what I expected. FINALDELAY didn’t seem to be observed. I am sure FINALDEALY is being observed properly, but I did not see it doing what I expected. Let’s try this other approach first. NOTE: it worked just fine.
Test 2 power down
It went as expected. Nothing much different.
Test 2 power up
That worked well. One of the PDU outlets is misconfigured.
Turns out I did not save my changes, as evidenced by PENDING in this screen shot I took:
Configuration 2 changes
I’ve tweaked the UPS shutdown time from 240 to 210 seconds.
Test 3 power down
This is the time to power off:
- 1:23 knew
- 2:18 slocum
- 2:42 r720-01
- 3:57 UPS
I have about 1:15 extra there, but I am hesitant to trim the UPS shutdown any further than that.
It takes 3 minutes to shutdown all servers staring at 5 minutes runtime left on UPS
Test 3 power up
For this power up test, once the UPS was shutdown, I disconnected it from mains, then reconnected it.
The front panel came out, but I still had to press POWER ON for the power to be delivered and the rack begin power on.
I wonder if I have a setting wrong somewhere. NOTE: the setting I want is Auto restart as found on page 11 of the 5PX_UPS_Manual_North_America.pdf manual.
knew powered up about 2 minutes after r720-01, as designed.
slocum powered up about 30 seconds later.
I was watching the r720-01 screen. The following is some of the timings I recorded:
- 3:35 #FreeBSD splash screen appears
- 5:00 jails are started
- 5:35 login prompt
- 5:42 knew responds to pings
- 5:55 slocum responds to pings
- 6:12 ssh into slocum & knew
- 13:30 – All jails are up and running
slocum takes the longest to start up. 34 jails running on 2x E5-2620v2 with 128GB RAM – hardware details.
The results
Everything powered on as expected. Good results.
My only remaining issue: will this UPS start up unattended when power returns?
Much later, I realized I missed the opportunity to find the answer from my previous tweet.
- do another power down test
- let the UPS power itself off
- turn off the breaker
- wait
- turn breaker back on
I find it amusing that it took me so long to come up with that. I also want to review the Auto restart setting via the front panel.
Sounds like a job for another day.
Another day
It is Sunday afternoon. I’m in the hallway outside the server room. The view looks like this.
Today I am going to trip the circuit breaker to the UPS and issue an FSD.
I have identified the breaker, tripped it, and verified the UPS goes on battery.
Test 4 power down
I turned off the breaker.
On the nut primary (my pfSense firewall), I issued this command:
/usr/local/sbin/upsmon -c fsd
These are the times I saw:
- 0:11 Notice of FSD
- 0:16 Kill message
- 1:58 knew powers off
- 2:38 slocum powers off
- 4:00 r720-01 powers off
- 4:00 ups offline
All good. I suspect r720-01 powered off earlier but I thought it was still powered up because the front display was active.
Test 4 power up
I turned on the breaker. The UPS started up right away.
These are the timings.
- 0:20 switches on
- 0:40 r720-01 power on
- 1:28 10G switch online
- 1:42 1G switch online
- 2:45 knew powers on
- 3:15 slocum powers on
- 4:24 r720-01 IPMI answers
- 6:25 knew answers pings
- 6:35 r720-01 finishes startup
- 6:45 slocum answers pings
- 7:05 sshd’d into knew & slouch
- 9:40 all knew jails started
- 14:30 all slocum jails started
- 22:45 UPS OK – Status=Online, Charging Utility=114.9V Batt=86.0% Load=49.0% Left=28.3min
- 24:40 All green on Nagios
Test ends.
Good. All good
Based on this, I’m confident the rack will properly shutdown when the power fails. Just as important, it will come back when the power returns.
Thank you for coming to my TED talk.
Now that I am not using a jail manager, I think my shutdown times will be shorter.
Dan, I think the article you created is very good.
A question.
I put you in situation:
If you order the servers to be turned off, 5 minutes before the UPS turns off so that everything is turned off, leaving only 1 minute for the UPS to turn off.
How do I turn on the servers if at that minute the UPS recovers power and the UPS never turns off but the servers are off?
The servers have never lost power and will not turn on.
Nice to comment with you
I don’t know. I suspect that now the UPS is committed to shutdown, it will. Then, it will see power back on and start up again. After getting to a predetermined battery charge level, it will supply power to the servers.
Much of my concept assumes how the UPS behaves.
For a more information answer, please ask here: https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/nut-upsuser
Did you find out the answer?
Hi, Dan!
First, I would like to thank you for your website. It helped me a lot to understand better how to set and test my NUT environment. I have some proxmox servers on my UPS, so it will be interesting to see how the servers will work with upsmon -c fsd haha
Congrats for this great job!
I’m glad it helped you.
My next goal: modify the shutdown so the two main servers shutdown first, then leave the network running so I have wifi.