Sep 132020
 

Today I tested the shutdown but was mostly concerned by the startup. I wanted the hosts to start in a specific order. I lived tweeted it and that thread is the basis for this post.

The UPS has three groups of outlets: primary group, group 1, group 2. Each group has a On Delay setting. You could choose which gear to power on first. I have all mine set to zero, so all outlets power on when the UPS comes online.

I will be using my APC PDUs to control the power-on order.

PDU configuration

These PDUs are old. Both the ssh and the https interfaces use ciphers which modern clients refuse to use.

$ ssh -oKexAlgorithms=+diffie-hellman-group-exchange-sha1 -oCiphers=+aes256-cbc pdu1
no matching key exchange method found. Their offer: diffie-hellman-group1-sha1

$ ssh -oKexAlgorithms=+ diffie-hellman-group1-sha1 -oCiphers=+aes256-cbc pdu1
next:no matching cipher found. Their offer: 3des-cbc,blowfish-cbc

$ ssh  -oKexAlgorithms=+diffie-hellman-group1-sha1 -oCiphers=+3des-cbc pdu1
ssh_ssh_dispatch_run_fatal: Connection to 10.52.0.2 port 22: Invalid key length

This idea came from debdrup:

setup a jail with misc/compatNx for whichever major release has the older SSH client you need.

Longer term, I like that.

For today, it’s a serial cable.

My servers each have 2 power supplies. I’ll plug each one into a different PDU.

I’ll configuring the power-on delay of each outlet so that r710-01 powers on first, then knew, followed by slocum.

The firewall and switches are all connected to the same non-managed PDU at the top of the rack. It powers up first. The firewall will be operational before any other host.

BIOS – power up or last state

Most system BIOS allow you to configure what happens when power is supplied to the host. Last state and power on are two common settings. Last state means if it was off, leave it off. If it was on, boot. I prefer power on. It also allows the power cycling of a managed outlet and have the host power up.

upsmon.conf – FINALDELAY

During the test of the shutdown process, I determined that FINALDELAY needs to be 210 seconds to give the hosts enough time to shutdown before powering off the UPS. That settings goes into upsmon.conf. NOTE: I later remove this during my testing.

upsmon.conf – SHUDOWNCMD

I have changed this setting in /usr/local/etc/nut/upsmon.conf on each host:

old: SHUTDOWNCMD “/sbin/shutdown -h +0”

new: SHUTDOWNCMD “/sbin/shutdown -p now”

I notice now that on my pfSense box, it still uses the old command. It works fine. Given that this is a power saving issue, pfSense is not a power issue; it is a APU2 (photos).

I want the hosts to power off, not sit idle until the UPS shuts down. Depending upon the BIOS reconfiguration, I may have to revert. NOTE: revert was not done.

Test 1 power down

The command, as used in https://dan.langille.org/2020/09/10/nut-testing-the-shutdown-mechanism/…, to simulate battery is low, shutdown now is:

/usr/local/sbin/upsmon -c fsd

I issued that command on the pfSense box because that is the nut primary.

I timed the things I could see happening and this is the summary:

+ 8 s notices on hosts
+ 50 r720 connection drops
+ 25 knew connection drops
+ 18 slocum connection drops
+ 21 knew powers off
+ 55 slocum powers off
+ 8 r720 powers off
+ 35 bast starts shutdown
+ 20 UPS stops replying to ping

+120? lights power off on rack as UPS shuts down

From the UPS logs:

2020/09/12	15:02:02	Master system shutdown is activated

From pfSense (the nut primary):

Sep 12 14:58:06 bast upsmon[50050]: Signal 30: User requested FSD
Sep 12 14:58:06 bast upsd[50219]: Client local-monitor@::1 set FSD on UPS [ups02]
Sep 12 14:58:06 bast upsmon[50050]: FSD set on UPS heartbeat failed: ERR ACCESS-DENIED
Sep 12 14:58:14 bast sshd[60370]: Connection closed by 2001:470:8abf:7055:b6f9:d572:6622:ea2d port 15668 [preauth]
Sep 12 14:58:16 bast upsmon[50050]: Executing automatic power-fail shutdown
Sep 12 14:58:16 bast upsmon[50050]: Auto logout and shutdown proceeding
Sep 12 14:59:26 bast upsd[50219]: write() failed for 2001:470:8abf:7055:21b:21ff:fe39:a9a1: Connection reset by peer
Sep 12 14:59:26 bast upsd[50219]: write() failed for 2001:470:8abf:7055:21b:21ff:fe39:a9a1: Broken pipe
Sep 12 15:01:46 bast shutdown: halt by root: 
Sep 12 15:01:55 bast syslogd: exiting on signal 15

on knew:

Sep 12 14:58:10 knew upsmon[93843]: UPS ups02@bast.int.unixathome.org: forced shutdown in progress
Sep 12 14:58:10 knew upsmon[93843]: FSD set on UPS heartbeat failed: ERR ACCESS-DENIED
Sep 12 14:58:10 knew upsmon[93843]: Executing automatic power-fail shutdown
Sep 12 14:58:10 knew upsmon[93843]: Auto logout and shutdown proceeding
Sep 12 14:58:15 knew shutdown[97618]: power-down by dan: 
Sep 12 14:58:15 knew bacula-fd[34545]: Shutting down Bacula service: knew-fd ...
Sep 12 14:58:43 knew kernel: pid 16101 (bacula-sd), jid 8, uid 910: exited on signal 10
Sep 12 14:59:22 knew kernel: vnet0.1: link state changed to DOWN
Sep 12 14:59:22 knew kernel: epair0b: link state changed to DOWN
Sep 12 14:59:23 knew kernel: ix0: link state changed to DOWN
Sep 12 14:59:23 knew kernel: ix0: link state changed to UP
Sep 12 14:59:24 knew nrpe[33429]: Caught SIGTERM - shutting down...
Sep 12 14:59:24 knew kernel: , 2285.
Sep 12 14:59:24 knew nrpe[33429]: Daemon shutdown
Sep 12 14:59:25 knew kernel: , 33429.
Sep 12 14:59:25 knew kernel: .
Sep 12 14:59:25 knew upsd[3403]: mainloop: Interrupted system call
Sep 12 14:59:25 knew ntpd[2149]: ntpd exiting on signal 15 (Terminated)
Sep 12 14:59:26 knew kernel: , 2149.
Sep 12 14:59:27 knew syslogd: exiting on signal 15

On slocum:

Sep 12 14:49:21 slocum upsmon[78899]: upsmon parent: read
Sep 12 14:49:21 slocum upsmon[99064]: Login on UPS [heartbeat] failed - got [ERR ACCESS-DENIED]
Sep 12 14:58:10 slocum upsmon[99064]: UPS ups02@bast.int.unixathome.org: forced shutdown in progress
Sep 12 14:58:10 slocum upsmon[99064]: FSD set on UPS heartbeat failed: ERR ACCESS-DENIED
Sep 12 14:58:10 slocum upsmon[99064]: Executing automatic power-fail shutdown
Sep 12 14:58:10 slocum upsmon[99064]: Auto logout and shutdown proceeding
Sep 12 14:58:16 slocum shutdown[17983]: power-down by dan: 
Sep 12 14:58:16 slocum root[18041]: jail
Sep 12 14:58:16 slocum bacula-fd[62870]: Shutting down Bacula service: slocum-fd ...
Sep 12 14:58:16 slocum kernel: .
Sep 12 14:59:46 slocum rc.shutdown[22960]: 90 second watchdog timeout expired. Shutdown terminated.
Sep 12 14:59:46 slocum init[1]: /etc/rc.shutdown terminated abnormally, going to single user mode
Sep 12 14:59:46 slocum syslogd: exiting on signal 15
Sep 12 15:13:23 slocum syslogd: kernel boot file is /boot/kernel/kernel

On the r720:

Sep 12 14:49:21 r720-01 upsmon[69359]: upsmon parent: read
Sep 12 14:49:21 r720-01 upsmon[3348]: Login on UPS [heartbeat] failed - got [ERR ACCESS-DENIED]
Sep 12 14:58:10 r720-01 upsmon[3348]: UPS ups02@bast.int.unixathome.org: forced shutdown in progress
Sep 12 14:58:10 r720-01 upsmon[3348]: FSD set on UPS heartbeat failed: ERR ACCESS-DENIED
Sep 12 14:58:10 r720-01 upsmon[3348]: Executing automatic power-fail shutdown
Sep 12 14:58:10 r720-01 upsmon[3348]: Auto logout and shutdown proceeding
Sep 12 14:58:15 r720-01 shutdown[8190]: power-down by dan: 
Sep 12 14:58:16 r720-01 bacula-fd[12842]: Shutting down Bacula service: r720-01-fd ...
Sep 12 14:58:23 r720-01 kernel: vnet0.3: link state changed to DOWN
Sep 12 14:58:23 r720-01 kernel: epair0b: link state changed to DOWN
Sep 12 14:58:24 r720-01 kernel: in6_purgeaddr: err=65, destination address delete failed
Sep 12 14:58:27 r720-01 kernel: vnet0.4: link state changed to DOWN
Sep 12 14:58:27 r720-01 kernel: epair0b: link state changed to DOWN
Sep 12 14:58:28 r720-01 kernel: in6_purgeaddr: err=65, destination address delete failed
Sep 12 14:58:39 r720-01 kernel: vnet0.5: link state changed to DOWN
Sep 12 14:58:39 r720-01 kernel: epair0b: link state changed to DOWN
Sep 12 14:58:40 r720-01 kernel: ix0: link state changed to DOWN
Sep 12 14:58:40 r720-01 kernel: WARNING: Intel (R) Network Connections are quality tested using Intel (R) Ethernet Optics. Using untested modules is not supported and may cause unstable operation or damage to the module or the adapter. Intel Corporation is not responsible for any harm caused by using untested modules.
Sep 12 14:58:40 r720-01 kernel: in6_purgeaddr: err=65, destination address delete failed
Sep 12 14:58:40 r720-01 kernel: WARNING: Intel (R) Network Connections are quality tested using Intel (R) Ethernet Optics. Using untested modules is not supported and may cause unstable operation or damage to the module or the adapter. Intel Corporation is not responsible for any harm caused by using untested modules.
Sep 12 14:58:40 r720-01 syslogd: last message repeated 1 times
Sep 12 14:58:41 r720-01 kernel: ix0: link state changed to UP
Sep 12 14:58:53 r720-01 kernel: , 2433.
Sep 12 14:58:53 r720-01 nrpe[2417]: Caught SIGTERM - shutting down...
Sep 12 14:58:53 r720-01 nrpe[2417]: Daemon shutdown
Sep 12 14:58:54 r720-01 kernel: , 2417.
Sep 12 14:58:54 r720-01 kernel: .
Sep 12 14:58:54 r720-01 upsd[69306]: mainloop: Interrupted system call
Sep 12 14:58:54 r720-01 ntpd[2276]: ntpd exiting on signal 15 (Terminated)
Sep 12 14:58:56 r720-01 kernel: , 2276.
Sep 12 14:58:57 r720-01 kernel: , 2221.
Sep 12 14:58:58 r720-01 kernel: , 2218.
Sep 12 14:58:59 r720-01 kernel: , 2215.
Sep 12 14:59:00 r720-01 kernel: , 2208.
Sep 12 14:59:00 r720-01 syslogd: exiting on signal 15
Sep 12 15:45:53 r720-01 syslogd: kernel boot file is /boot/kernel/kernel

Yeah, I’m using FiberStore optics in there. They work fine, apart from those message. My post to the FreeBSD network mailing list confirms this is expected.

Test 1 power up

I powered up the UPS, the rack came back to life, but in the wrong order.

I missed a step. I needed to alter what was plugged into what outlet on the PDU. I had servers screaming because they’re on only one PSU. ;)

Once they stopped screaming, I moved the cables.

Both knew and r720-01 needed power-on set in BIOS. They were both last-state.

Configuration 1 changes

After what seemed like a 2 minute delay on the UPS powering off, I found a setting in its GUI which showed me why.

Changed:

Remaining time 120->300
Capacity under: 0 -> 20
Shutdown Duration: 120->240
Restart if capacity exceed: 0->30%

I removed the FINALDELAY from upsmon.conf on the nut primary.

This is moving the UPS power off logic into the UPS and away from nut. Based on logs, nut wasn’t doing what I expected. FINALDELAY didn’t seem to be observed. I am sure FINALDEALY is being observed properly, but I did not see it doing what I expected. Let’s try this other approach first. NOTE: it worked just fine.

Test 2 power down

It went as expected. Nothing much different.

Test 2 power up

That worked well. One of the PDU outlets is misconfigured.

Turns out I did not save my changes, as evidenced by PENDING in this screen shot I took:

Configuration 2 changes

I’ve tweaked the UPS shutdown time from 240 to 210 seconds.

Test 3 power down

This is the time to power off:

  • 1:23 knew
  • 2:18 slocum
  • 2:42 r720-01
  • 3:57 UPS

I have about 1:15 extra there, but I am hesitant to trim the UPS shutdown any further than that.

It takes 3 minutes to shutdown all servers staring at 5 minutes runtime left on UPS

Test 3 power up

For this power up test, once the UPS was shutdown, I disconnected it from mains, then reconnected it.

The front panel came out, but I still had to press POWER ON for the power to be delivered and the rack begin power on.

I wonder if I have a setting wrong somewhere. NOTE: the setting I want is Auto restart as found on page 11 of the 5PX_UPS_Manual_North_America.pdf manual.

knew powered up about 2 minutes after r720-01, as designed.

slocum powered up about 30 seconds later.

I was watching the r720-01 screen. The following is some of the timings I recorded:

  • 3:35 #FreeBSD splash screen appears
  • 5:00 jails are started
  • 5:35 login prompt
  • 5:42 knew responds to pings
  • 5:55 slocum responds to pings
  • 6:12 ssh into slocum & knew
  • 13:30 – All jails are up and running

slocum takes the longest to start up. 34 jails running on 2x E5-2620v2 with 128GB RAM – hardware details.

The results

Everything powered on as expected. Good results.

My only remaining issue: will this UPS start up unattended when power returns?

Much later, I realized I missed the opportunity to find the answer from my previous tweet.

  • do another power down test
  • let the UPS power itself off
  • turn off the breaker
  • wait
  • turn breaker back on

I find it amusing that it took me so long to come up with that. I also want to review the Auto restart setting via the front panel.

Sounds like a job for another day.

Another day

It is Sunday afternoon. I’m in the hallway outside the server room. The view looks like this.

Server room – as seen from the hallway

Today I am going to trip the circuit breaker to the UPS and issue an FSD.

I have identified the breaker, tripped it, and verified the UPS goes on battery.

Test 4 power down

I turned off the breaker.

On the nut primary (my pfSense firewall), I issued this command:

/usr/local/sbin/upsmon -c fsd

These are the times I saw:

  • 0:11 Notice of FSD
  • 0:16 Kill message
  • 1:58 knew powers off
  • 2:38 slocum powers off
  • 4:00 r720-01 powers off
  • 4:00 ups offline

All good. I suspect r720-01 powered off earlier but I thought it was still powered up because the front display was active.

Test 4 power up

I turned on the breaker. The UPS started up right away.

These are the timings.

  • 0:20 switches on
  • 0:40 r720-01 power on
  • 1:28 10G switch online
  • 1:42 1G switch online
  • 2:45 knew powers on
  • 3:15 slocum powers on
  • 4:24 r720-01 IPMI answers
  • 6:25 knew answers pings
  • 6:35 r720-01 finishes startup
  • 6:45 slocum answers pings
  • 7:05 sshd’d into knew & slouch
  • 9:40 all knew jails started
  • 14:30 all slocum jails started
  • 22:45 UPS OK – Status=Online, Charging Utility=114.9V Batt=86.0% Load=49.0% Left=28.3min
  • 24:40 All green on Nagios

Test ends.

Good. All good

Based on this, I’m confident the rack will properly shutdown when the power fails. Just as important, it will come back when the power returns.

Thank you for coming to my TED talk.

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive