Following on from my recent nut setup, this is the second in a series of three posts.
The next post will deal with adjusting startup and shutdown times to be sure everything proceeds as required.
I want to test the host shutdown mechanism without:
- unplugging the UPS from the mains
- powering off the UPS
- without powering off the servers
I have just updated all the hosts and the jails on those hosts. This includes both patching the OS for recently vulnerabilities (freebsd-update fetch install) and updating all installed packages (pkg upgrade). I almost always follow that practice when rebooting the servers.
-t Enable testing mode. This also enables debug mode. Testing mode makes upsdrvctl display the actions it would execute without actually doing them. Use this to test out your configuration without actually doing anything to your UPS drivers. This may be helpful when defining the sdorder directive in your ups.conf(5).
Let’s try that out. Sure, I’m not at home, Indian food is about to be delivered, and what could possibly go wrong?
This is on one of the secondaries:
[dan@slocum:/usr/local/etc/nut] $ sudo /usr/local/sbin/upsdrvctl -t shutdown Network UPS Tools - UPS driver controller 2.7.4 *** Testing mode: not calling exec/kill 0.000000 If you're not a NUT core developer, chances are that you're told to enable debugging to see why a driver isn't working for you. We're sorry for the confusion, but this is the 'upsdrvctl' wrapper, not the driver you're interested in. Below you'll find one or more lines starting with 'exec:' followed by an absolute path to the driver binary and some command line option. This is what the driver starts and you need to copy and paste that line and append the debug flags to that line (less the 'exec:' prefix). 0.000155 Shutdown UPS: ups02 0.000166 exec: /usr/local/libexec/nut/dummy-ups -a ups02 -k 0.000171 Shutdown UPS: heartbeat 0.000178 exec: /usr/local/libexec/nut/dummy-ups -a heartbeat -k [dan@slocum:/usr/local/etc/nut] $
Seems OK. Let’s try the nut primary:
[2.4.5-RELEASE][email@example.com]/root: sudo /usr/local/sbin/upsdrvctl -t shutdown Network UPS Tools - UPS driver controller 2.7.4 *** Testing mode: not calling exec/kill 0.000000 If you're not a NUT core developer, chances are that you're told to enable debugging to see why a driver isn't working for you. We're sorry for the confusion, but this is the 'upsdrvctl' wrapper, not the driver you're interested in. Below you'll find one or more lines starting with 'exec:' followed by an absolute path to the driver binary and some command line option. This is what the driver starts and you need to copy and paste that line and append the debug flags to that line (less the 'exec:' prefix). 0.000352 Shutdown UPS: ups02 0.000380 exec: /usr/local/libexec/nut/usbhid-ups -a ups02 -k 0.000395 Shutdown UPS: heartbeat 0.000414 exec: /usr/local/libexec/nut/dummy-ups -a heartbeat -k [2.4.5-RELEASE][firstname.lastname@example.org]/root:
Hmm, that seems OK too.
Testing with actual results
For this, I’m changing the power off process so the servers reboot, because I’m not at home and I do want a reboot here, after those upgrades.
[dan@slocum:/usr/local/etc/nut] $ sudo grep shutdown *.conf upsmon.conf:SHUTDOWNCMD "/sbin/shutdown -p +0"
For testing, from here, I’m going to let that be -r (reboot), not -p (power off), because that is exactly what I want to do after all those recent upgrades.
[dan@slocum:/usr/local/etc/nut] $ sudo grep shutdown *.conf upsmon.conf:SHUTDOWNCMD "/sbin/shutdown -r +0"
After making that change, I wanted to be sure it was picked up and used:
sudo service nut stop sudo service nut_upsmon stop sudo service nut start sudo service nut_upsmon start
On the nut primary, I will use this command: /usr/local/sbin/upsmon -c fsd
What will that do? From man upsmon:
Send the command command to the existing upsmon process. Valid commands are:
shutdown all master UPSes (use with caution)
All without actually touching the UPS. Let’s try this before dinner arrives:
[2.4.5-RELEASE][email@example.com]/usr/local/etc/nut: /usr/local/sbin/upsmon -c fsd Network UPS Tools upsmon 2.7.4 Broadcast Message from firstname.lastname@example.org (no tty) at 23:38 UTC... Executing automatic power-fail shutdown Broadcast Message from email@example.com (no tty) at 23:38 UTC... Auto logout and shutdown proceeding *** FINAL System shutdown message from firstname.lastname@example.org *** System going down IMMEDIATELY Broadcast Message from email@example.com (/dev/console) at 23:38 UTC... NUT killing power...
On one of the hosts, I saw:
Sep 5 23:38:31 slocum upsmon: UPS firstname.lastname@example.org: forced shutdown in progress Sep 5 23:38:31 slocum upsmon: FSD set on UPS heartbeat failed: ERR ACCESS-DENIED Sep 5 23:38:31 slocum upsmon: Executing automatic power-fail shutdown Sep 5 23:38:31 slocum upsmon: Auto logout and shutdown proceeding Sep 5 23:38:36 slocum shutdown: reboot by dan: Sep 5 23:38:36 slocum root: jail Sep 5 23:38:36 slocum bacula-fd: Shutting down Bacula service: slocum-fd ... Sep 5 23:38:36 slocum kernel: .
Oh, there’s the text message saying the food is on the way!
About 5 minutes later, I could log back into my VPN:
[2.4.5-RELEASE][email@example.com]/root: uptime 11:42PM up 2 mins, 3 users, load averages: 0.53, 0.19, 0.07
Now I’m waiting for the other servers to come back.
knew is back up. slocum is back.
It takes longer for the R720 to reboot, because it’s complicated…
Oh, there’s the food delivery arriving!
And there’s the pings, and I can log in. Success!
I’ll check Nagios later.
Lovely dinner. All green on Nagios.
UPS CRITICAL - Status=Off Utility=118.1V Batt=100.0% Load=0.0% Left=157.5min
I found an interesting situation. The UPS is off, but answering.
$ upsc ups02 ups.status OL OFF
I can run a command and power it on. Let’s see:
[2.4.5-RELEASE][firstname.lastname@example.org]/usr/local/etc/nut: upscmd -l ups02 Instant commands supported on UPS [ups02]: beeper.disable - Disable the UPS beeper beeper.enable - Enable the UPS beeper beeper.mute - Temporarily mute the UPS beeper beeper.off - Obsolete (use beeper.disable or beeper.mute) beeper.on - Obsolete (use beeper.enable) load.off - Turn off the load immediately load.off.delay - Turn off the load with a delay (seconds) load.on - Turn on the load immediately load.on.delay - Turn on the load with a delay (seconds) outlet.1.load.off - Turn off the load on outlet 1 immediately outlet.1.load.on - Turn on the load on outlet 1 immediately outlet.2.load.off - Turn off the load on outlet 2 immediately outlet.2.load.on - Turn on the load on outlet 2 immediately shutdown.return - Turn off the load and return when power is back shutdown.stayoff - Turn off the load and remain off shutdown.stop - Stop a shutdown in progress test.battery.start.deep - Start a deep battery test test.battery.start.quick - Start a quick battery test test.battery.stop - Stop the battery test [2.4.5-RELEASE][email@example.com]/usr/local/etc/nut: upscmd ups02 load.on Username (root): dvl Password: OK [2.4.5-RELEASE][firstname.lastname@example.org]/usr/local/etc/nut: upsc ups02 ups.status OL
See my previous post for where the dvl credentials are used.
This is from the Shutdown design section of the nut Configuration notes
I am reproducing that content here and adding my notes in italics.
The original content (perhaps slightly modified) is in bold.
When your UPS batteries get low, the operating system needs to be brought down cleanly. Also, the UPS load should be turned off so that all devices that are attached to it are forcibly rebooted.
Here are the steps that occur when a critical power event happens:
- The UPS goes on battery
- The UPS reaches low battery (a “critical” UPS), that is to say upsc displays:
ups.status: OB LB
The exact behavior depends on the specific device, and is related to:
- battery.charge and battery.charge.low
- battery.runtime and battery.runtime.low
My data points are:
# this is % [dan@knew:~] $ upsc ups02 battery.charge 97 [dan@knew:~] $ upsc ups02 battery.charge.low 20 # this is seconds [dan@knew:~] $ upsc ups02 battery.runtime 1816 [dan@knew:~] $ upsc ups02 battery.runtime.low Error: Variable not supported by UPS [dan@knew:~] $
- The upsmon master notices and sets “FSD” – the “forced shutdown” flag to tell all secondary systems that it will soon power down the load.
(If you have no secondaries, skip to step 6)
- upsmon secondary systems see “FSD” and:
- generate a NOTIFY_SHUTDOWN event
- wait FINALDELAY seconds – typically 5
- call their SHUTDOWNCMD
- disconnect from upsd
This is good enough for all my secondary hosts
- The upsmon master system waits up to HOSTSYNC seconds (typically 15) for the secondaries to disconnect from upsd. If any are connected after this time, upsmon stops waiting and proceeds with the shutdown process.
- The upsmon master:
- generates a NOTIFY_SHUTDOWN event
- waits FINALDELAY seconds – typically 5
- creates the POWERDOWNFLAG file – usually /etc/killpower
- calls the SHUTDOWNCMD
I think FINALDELAY needs to be 210 seconds for my upsmon master
These are my server power-off times (from shutdown -p now to power off):
- knew – 1:53
- slouch – 2:25
- r720-01 – 3:11
The upsmon master must wait at least 3:45 before telling the UPS it can power off.
If your #FreeBSD rc.d system does not complete rc.shutdown in 90 seconds (default value), the shutdown gets interrupted.
From /etc/defaults/rc.conf: rcshutdown_timeout =”90″
This is what to look for in /var/log/messages:
init: /etc/rc.shutdown terminated abnormally, going to single user mode
Only one of my hosts (r720-01) has been found to exceed the rcshutdown_timeout limit. For that host, I set:
[dan@r720-01:~] $ grep shut /etc/rc.conf rcshutdown_timeout="120"
- On most systems, init takes over, kills your processes, syncs and unmounts some filesystems, and remounts some read-only.
- init then runs your shutdown script. This checks for the POWERDOWNFLAG, finds it, and tells the UPS driver(s) to power off the load.
For other systems, you could add similar code to rc.shutdown Look for Insert other shutdown procedures here
- The system loses power.
- Time passes. The power returns, and the UPS switches back on.
- All systems reboot and go back to work.
That is a run time of about 30 minutes.
This is good enough for all my secondary hosts
But wait, there’s more
My next project, this coming weekend: make sure the shutdown and power up processes both go smoothly.