Sep 102020
 

Following on from my recent nut setup, this is the second in a series of three posts.

The next post will deal with adjusting startup and shutdown times to be sure everything proceeds as required.

I want to test the host shutdown mechanism without:

  • unplugging the UPS from the mains
  • powering off the UPS
  • without powering off the servers

I have just updated all the hosts and the jails on those hosts. This includes both patching the OS for recently vulnerabilities (freebsd-update fetch install) and updating all installed packages (pkg upgrade). I almost always follow that practice when rebooting the servers.

To simulate a power outage, I’m using the Testing shutdowns section of the documentation. According to man upsdrvctl:

-t Enable testing mode. This also enables debug mode. Testing mode makes upsdrvctl display the actions it would execute without actually doing them. Use this to test out your configuration without actually doing anything to your UPS drivers. This may be helpful when defining the sdorder directive in your ups.conf(5).

Let’s try that out. Sure, I’m not at home, Indian food is about to be delivered, and what could possibly go wrong?

This is on one of the secondaries:

[dan@slocum:/usr/local/etc/nut] $ sudo /usr/local/sbin/upsdrvctl -t shutdown
Network UPS Tools - UPS driver controller 2.7.4
*** Testing mode: not calling exec/kill
   0.000000	
If you're not a NUT core developer, chances are that you're told to enable debugging
to see why a driver isn't working for you. We're sorry for the confusion, but this is
the 'upsdrvctl' wrapper, not the driver you're interested in.

Below you'll find one or more lines starting with 'exec:' followed by an absolute
path to the driver binary and some command line option. This is what the driver
starts and you need to copy and paste that line and append the debug flags to that
line (less the 'exec:' prefix).

   0.000155	Shutdown UPS: ups02
   0.000166	exec:  /usr/local/libexec/nut/dummy-ups -a ups02 -k
   0.000171	Shutdown UPS: heartbeat
   0.000178	exec:  /usr/local/libexec/nut/dummy-ups -a heartbeat -k
[dan@slocum:/usr/local/etc/nut] $ 

Seems OK. Let’s try the nut primary:

[2.4.5-RELEASE][admin@bast.int.unixathome.org]/root: sudo /usr/local/sbin/upsdrvctl -t shutdown
Network UPS Tools - UPS driver controller 2.7.4
*** Testing mode: not calling exec/kill
   0.000000	
If you're not a NUT core developer, chances are that you're told to enable debugging
to see why a driver isn't working for you. We're sorry for the confusion, but this is
the 'upsdrvctl' wrapper, not the driver you're interested in.

Below you'll find one or more lines starting with 'exec:' followed by an absolute
path to the driver binary and some command line option. This is what the driver
starts and you need to copy and paste that line and append the debug flags to that
line (less the 'exec:' prefix).

   0.000352	Shutdown UPS: ups02
   0.000380	exec:  /usr/local/libexec/nut/usbhid-ups -a ups02 -k
   0.000395	Shutdown UPS: heartbeat
   0.000414	exec:  /usr/local/libexec/nut/dummy-ups -a heartbeat -k
[2.4.5-RELEASE][admin@bast.int.unixathome.org]/root: 

Hmm, that seems OK too.

Testing with actual results

For this, I’m changing the power off process so the servers reboot, because I’m not at home and I do want a reboot here, after those upgrades.

Before:

[dan@slocum:/usr/local/etc/nut] $ sudo grep shutdown *.conf
upsmon.conf:SHUTDOWNCMD "/sbin/shutdown -p +0"

For testing, from here, I’m going to let that be -r (reboot), not -p (power off), because that is exactly what I want to do after all those recent upgrades.

After:

[dan@slocum:/usr/local/etc/nut] $ sudo grep shutdown *.conf
upsmon.conf:SHUTDOWNCMD "/sbin/shutdown -r +0"

After making that change, I wanted to be sure it was picked up and used:

sudo service nut stop
sudo service nut_upsmon stop
sudo service nut start
sudo service nut_upsmon start

On the nut primary, I will use this command: /usr/local/sbin/upsmon -c fsd

What will that do? From man upsmon:

-c command
Send the command command to the existing upsmon process. Valid commands are:

fsd
shutdown all master UPSes (use with caution)

All without actually touching the UPS. Let’s try this before dinner arrives:

[2.4.5-RELEASE][admin@bast.int.unixathome.org]/usr/local/etc/nut: /usr/local/sbin/upsmon -c fsd 
Network UPS Tools upsmon 2.7.4
                                                                               
Broadcast Message from root@bast.int.unixathome.org                            
        (no tty) at 23:38 UTC...                                               
                                                                               
Executing automatic power-fail shutdown                                        

                                                                               
                                                                               
Broadcast Message from root@bast.int.unixathome.org                            
        (no tty) at 23:38 UTC...                                               
                                                                               
Auto logout and shutdown proceeding                                            
                                                                               
                                                                               
*** FINAL System shutdown message from root@bast.int.unixathome.org ***      

System going down IMMEDIATELY                                                  

                                                                               
                                                                               
Broadcast Message from root@bast.int.unixathome.org                            
        (/dev/console) at 23:38 UTC...                                         
                                                                               
NUT killing power...                                                           

On one of the hosts, I saw:

Sep  5 23:38:31 slocum upsmon[19470]: UPS ups02@bast.int.unixathome.org: forced shutdown in progress
Sep  5 23:38:31 slocum upsmon[19470]: FSD set on UPS heartbeat failed: ERR ACCESS-DENIED
Sep  5 23:38:31 slocum upsmon[19470]: Executing automatic power-fail shutdown
Sep  5 23:38:31 slocum upsmon[19470]: Auto logout and shutdown proceeding
Sep  5 23:38:36 slocum shutdown[61799]: reboot by dan: 
Sep  5 23:38:36 slocum root[61858]: jail
Sep  5 23:38:36 slocum bacula-fd[61571]: Shutting down Bacula service: slocum-fd ...
Sep  5 23:38:36 slocum kernel: .

Oh, there’s the text message saying the food is on the way!

About 5 minutes later, I could log back into my VPN:

[2.4.5-RELEASE][admin@bast.int.unixathome.org]/root: uptime
11:42PM  up 2 mins, 3 users, load averages: 0.53, 0.19, 0.07

Now I’m waiting for the other servers to come back.

knew is back up. slocum is back.

It takes longer for the R720 to reboot, because it’s complicated…

Oh, there’s the food delivery arriving!

And there’s the pings, and I can log in. Success!

I’ll check Nagios later.

Lovely dinner. All green on Nagios.

Oh wait:

UPS CRITICAL - Status=Off Utility=118.1V Batt=100.0% Load=0.0% Left=157.5min 

I found an interesting situation. The UPS is off, but answering.

$ upsc ups02 ups.status
OL OFF

I can run a command and power it on. Let’s see:

[2.4.5-RELEASE][admin@bast.int.unixathome.org]/usr/local/etc/nut: upscmd -l ups02
Instant commands supported on UPS [ups02]:

beeper.disable - Disable the UPS beeper
beeper.enable - Enable the UPS beeper
beeper.mute - Temporarily mute the UPS beeper
beeper.off - Obsolete (use beeper.disable or beeper.mute)
beeper.on - Obsolete (use beeper.enable)
load.off - Turn off the load immediately
load.off.delay - Turn off the load with a delay (seconds)
load.on - Turn on the load immediately
load.on.delay - Turn on the load with a delay (seconds)
outlet.1.load.off - Turn off the load on outlet 1 immediately
outlet.1.load.on - Turn on the load on outlet 1 immediately
outlet.2.load.off - Turn off the load on outlet 2 immediately
outlet.2.load.on - Turn on the load on outlet 2 immediately
shutdown.return - Turn off the load and return when power is back
shutdown.stayoff - Turn off the load and remain off
shutdown.stop - Stop a shutdown in progress
test.battery.start.deep - Start a deep battery test
test.battery.start.quick - Start a quick battery test
test.battery.stop - Stop the battery test

[2.4.5-RELEASE][admin@bast.int.unixathome.org]/usr/local/etc/nut: upscmd ups02 load.on
Username (root): dvl
Password: 
OK
[2.4.5-RELEASE][admin@bast.int.unixathome.org]/usr/local/etc/nut: upsc ups02 ups.status
OL

See my previous post for where the dvl credentials are used.

Shutdown overview

This is from the Shutdown design section of the nut Configuration notes

I am reproducing that content here and adding my notes in italics.

The original content (perhaps slightly modified) is in bold.

Shutdown design

When your UPS batteries get low, the operating system needs to be brought down cleanly. Also, the UPS load should be turned off so that all devices that are attached to it are forcibly rebooted.

Here are the steps that occur when a critical power event happens:

  1. The UPS goes on battery
  2. The UPS reaches low battery (a “critical” UPS), that is to say upsc displays:
    ups.status: OB LB

    The exact behavior depends on the specific device, and is related to:

    • battery.charge and battery.charge.low
    • battery.runtime and battery.runtime.low

    My data points are:

    # this is %
    [dan@knew:~] $ upsc ups02 battery.charge
    97
    [dan@knew:~] $ upsc ups02 battery.charge.low
    20
    
    # this is seconds 
    [dan@knew:~] $ upsc ups02 battery.runtime
    1816
    [dan@knew:~] $ upsc ups02 battery.runtime.low
    Error: Variable not supported by UPS
    [dan@knew:~] $

  3. That is a run time of about 30 minutes.

  4. The upsmon master notices and sets “FSD” – the “forced shutdown” flag to tell all secondary systems that it will soon power down the load.

    (If you have no secondaries, skip to step 6)

  5. upsmon secondary systems see “FSD” and:
    • generate a NOTIFY_SHUTDOWN event
    • wait FINALDELAY seconds – typically 5
    • call their SHUTDOWNCMD
    • disconnect from upsd

    This is good enough for all my secondary hosts

  6. The upsmon master system waits up to HOSTSYNC seconds (typically 15) for the secondaries to disconnect from upsd. If any are connected after this time, upsmon stops waiting and proceeds with the shutdown process.
  7. This is good enough for all my secondary hosts

  8. The upsmon master:
    • generates a NOTIFY_SHUTDOWN event
    • waits FINALDELAY seconds – typically 5
    • creates the POWERDOWNFLAG file – usually /etc/killpower
    • calls the SHUTDOWNCMD

    I think FINALDELAY needs to be 210 seconds for my upsmon master

    These are my server power-off times (from shutdown -p now to power off):

    • knew – 1:53
    • slouch – 2:25
    • r720-01 – 3:11

    The upsmon master must wait at least 3:45 before telling the UPS it can power off.

    If your #FreeBSD rc.d system does not complete rc.shutdown in 90 seconds (default value), the shutdown gets interrupted.

    From /etc/defaults/rc.conf: rcshutdown_timeout =”90″

    This is what to look for in /var/log/messages:

    init[1]: /etc/rc.shutdown terminated abnormally, going to single user mode

    Only one of my hosts (r720-01) has been found to exceed the rcshutdown_timeout limit. For that host, I set:

    [dan@r720-01:~] $ grep shut /etc/rc.conf
    rcshutdown_timeout="120"
    

  9. On most systems, init takes over, kills your processes, syncs and unmounts some filesystems, and remounts some read-only.
  10. init then runs your shutdown script. This checks for the POWERDOWNFLAG, finds it, and tells the UPS driver(s) to power off the load.

    This is built into pfSense, which runs my upsmon master – see my tweets where I discovered pfSense shutdown

    For other systems, you could add similar code to rc.shutdown Look for Insert other shutdown procedures here

  11. The system loses power.
  12. Time passes. The power returns, and the UPS switches back on.
  13. All systems reboot and go back to work.

But wait, there’s more

My next project, this coming weekend: make sure the shutdown and power up processes both go smoothly.

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive