Yesterday a big storm came through. Many downed trees (photos here). The power went off at about 0500 and came back again at about 1009, off again at 1309, back at 1319. It was then stable for the rest of the day.
Overnight, I saw monitoring alerts which initially made me think there was another power cut. uptime showed that was not the case. I also didn’t see any timeout messages meaning the UPS was powering down.
This post is for me to write down what I’m seeing in the logs and monitoring. Perhaps I’ll figure out the cause.
All times are UTC.
First hint – camera
It was my camera notification saying it was offline at 0334 and then again at 0313 which first brought this to my attention. I must note that these alerts arrive after the camera has not been streaming for a given time. The events which caused the camera to go offline must have been earlier.
03:20 – r730-01 & r730-03 lose heartbeat connect to gw01
Looking in other logs, it’s saying it could not talk to gw01.
Apr 5 03:20:49 r730-01 upsmon[1818]: Poll UPS [ups04@gw01.int.unixathome.org] failed - Server disconnected Apr 5 03:20:49 r730-01 upsmon[1818]: Communications with UPS ups04@gw01.int.unixathome.org lost Apr 5 03:20:59 r730-01 upsd[1811]: Data for UPS [ups04] is stale - check driver Apr 5 03:22:14 r730-01 upsd[1811]: UPS [ups04] data is no longer stale Apr 5 03:22:19 r730-01 upsmon[1818]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out Apr 5 03:22:19 r730-01 upsmon[1818]: Poll UPS [heartbeat] failed - Server disconnected Apr 5 03:22:19 r730-01 upsmon[1818]: Communications with UPS heartbeat lost Apr 5 03:22:19 r730-01 upsmon[1818]: UPS [heartbeat] was last known to be not fully online and currently is not communicating, assuming dead Apr 5 03:22:30 r730-01 upsd[1811]: Data for UPS [ups04] is stale - check driver Apr 5 03:23:40 r730-01 upsd[1811]: UPS [ups04] data is no longer stale Apr 5 03:23:49 r730-01 upsmon[1818]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out Apr 5 03:23:49 r730-01 upsmon[1818]: Communications with UPS heartbeat established Apr 5 03:23:56 r730-01 upsd[1811]: Data for UPS [ups04] is stale - check driver Apr 5 03:25:06 r730-01 upsd[1811]: UPS [ups04] data is no longer stale Apr 5 03:25:19 r730-01 upsmon[1818]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out Apr 5 03:25:19 r730-01 upsmon[1818]: Poll UPS [heartbeat] failed - Server disconnected Apr 5 03:25:19 r730-01 upsmon[1818]: Communications with UPS heartbeat lost Apr 5 03:25:22 r730-01 upsd[1811]: Data for UPS [ups04] is stale - check driver Apr 5 03:26:32 r730-01 upsd[1811]: UPS [ups04] data is no longer stale Apr 5 03:26:38 r730-01 upsmon[1818]: Communications with UPS ups04@gw01.int.unixathome.org established Apr 5 03:26:38 r730-01 upsmon[1818]: Communications with UPS heartbeat established
And the other main host:
Apr 5 03:20:51 r730-03 upsmon[1897]: Poll UPS [ups04@gw01.int.unixathome.org] failed - Server disconnected Apr 5 03:20:51 r730-03 upsmon[1897]: Communications with UPS ups04@gw01.int.unixathome.org lost Apr 5 03:20:58 r730-03 upsd[1876]: Data for UPS [ups04] is stale - check driver Apr 5 03:22:14 r730-03 upsd[1876]: UPS [ups04] data is no longer stale Apr 5 03:22:21 r730-03 upsmon[1897]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out Apr 5 03:22:21 r730-03 upsmon[1897]: Poll UPS [heartbeat] failed - Server disconnected Apr 5 03:22:21 r730-03 upsmon[1897]: Communications with UPS heartbeat lost Apr 5 03:22:21 r730-03 upsmon[1897]: UPS [heartbeat] was last known to be not fully online and currently is not communicating, assuming dead Apr 5 03:22:30 r730-03 upsd[1876]: Data for UPS [ups04] is stale - check driver Apr 5 03:23:32 r730-03 smartd[15629]: Device: /dev/da1 [SAT], 16 Currently unreadable (pending) sectors Apr 5 03:23:32 r730-03 smartd[15629]: Device: /dev/da1 [SAT], 2 Offline uncorrectable sectors Apr 5 03:23:41 r730-03 upsd[1876]: UPS [ups04] data is no longer stale Apr 5 03:23:51 r730-03 upsmon[1897]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out Apr 5 03:23:51 r730-03 upsmon[1897]: Communications with UPS heartbeat established Apr 5 03:23:57 r730-03 upsd[1876]: Data for UPS [ups04] is stale - check driver Apr 5 03:25:07 r730-03 upsd[1876]: UPS [ups04] data is no longer stale Apr 5 03:25:21 r730-03 upsmon[1897]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out Apr 5 03:25:21 r730-03 upsmon[1897]: Poll UPS [heartbeat] failed - Server disconnected Apr 5 03:25:21 r730-03 upsmon[1897]: Communications with UPS heartbeat lost Apr 5 03:25:21 r730-03 upsmon[1897]: UPS [heartbeat] was last known to be not fully online and currently is not communicating, assuming dead Apr 5 03:25:23 r730-03 upsd[1876]: Data for UPS [ups04] is stale - check driver Apr 5 03:26:33 r730-03 upsd[1876]: UPS [ups04] data is no longer stale Apr 5 03:26:41 r730-03 upsmon[1897]: Communications with UPS ups04@gw01.int.unixathome.org established Apr 5 03:26:41 r730-03 upsmon[1897]: Communications with UPS heartbeat established
Based on those messages, the event started at about 03:20:49 and ended at about 03:26:33.
03:26 – gw01
On the gateway (gw01, I found these log messages (I have filtered out the openvpn messages).
Apr 5 03:19:00 gw01 upsmon[2554]: UPS heartbeat on battery Apr 5 03:20:50 gw01 root[56295]: dhclient-exit-hooks but for something we ignore (reason: RENEW, iface: igc0) Apr 5 03:24:02 gw01 upsmon[2554]: UPS heartbeat on line power Apr 5 03:26:17 gw01 kernel: igc3: link state changed to DOWN Apr 5 03:26:17 gw01 kernel: vlan2: link state changed to DOWN Apr 5 03:26:17 gw01 kernel: vlan3: link state changed to DOWN Apr 5 03:26:17 gw01 kernel: vlan4: link state changed to DOWN Apr 5 03:26:17 gw01 kernel: vlan219: link state changed to DOWN Apr 5 03:26:17 gw01 kernel: vlan7: link state changed to DOWN Apr 5 03:26:17 gw01 kernel: vlan8: link state changed to DOWN Apr 5 03:26:18 gw01 radvd[2378]: sendmsg: Network is down Apr 5 03:26:19 gw01 kernel: igc3: link state changed to UP Apr 5 03:26:19 gw01 kernel: vlan2: link state changed to UP Apr 5 03:26:19 gw01 kernel: vlan3: link state changed to UP Apr 5 03:26:19 gw01 kernel: vlan4: link state changed to UP Apr 5 03:26:19 gw01 kernel: vlan219: link state changed to UP Apr 5 03:26:19 gw01 kernel: vlan7: link state changed to UP Apr 5 03:26:19 gw01 kernel: vlan8: link state changed to UP
But that does not account for the Poll UPS disconnections at about 03:20 on r730-01 and r730-03.
Nearly an hour later:
Apr 5 04:15:16 gw01 kernel: igc3: link state changed to DOWN Apr 5 04:15:16 gw01 kernel: vlan2: link state changed to DOWN Apr 5 04:15:16 gw01 kernel: vlan3: link state changed to DOWN Apr 5 04:15:16 gw01 kernel: vlan4: link state changed to DOWN Apr 5 04:15:16 gw01 kernel: vlan219: link state changed to DOWN Apr 5 04:15:16 gw01 kernel: vlan7: link state changed to DOWN Apr 5 04:15:16 gw01 kernel: vlan8: link state changed to DOWN Apr 5 04:15:22 gw01 kernel: igc3: link state changed to UP Apr 5 04:15:22 gw01 kernel: vlan2: link state changed to UP Apr 5 04:15:22 gw01 kernel: vlan3: link state changed to UP Apr 5 04:15:22 gw01 kernel: vlan4: link state changed to UP Apr 5 04:15:22 gw01 kernel: vlan219: link state changed to UP Apr 5 04:15:22 gw01 kernel: vlan7: link state changed to UP Apr 5 04:15:22 gw01 kernel: vlan8: link state changed to UP
Followed by:
Apr 5 04:18:41 gw01 kernel: igc3: link state changed to DOWN Apr 5 04:18:41 gw01 kernel: vlan2: link state changed to DOWN Apr 5 04:18:41 gw01 kernel: vlan3: link state changed to DOWN Apr 5 04:18:41 gw01 kernel: vlan4: link state changed to DOWN Apr 5 04:18:41 gw01 kernel: vlan219: link state changed to DOWN Apr 5 04:18:41 gw01 kernel: vlan7: link state changed to DOWN Apr 5 04:18:41 gw01 kernel: vlan8: link state changed to DOWN Apr 5 04:18:41 gw01 radvd[2378]: sendmsg: Network is down Apr 5 04:18:44 gw01 kernel: igc3: link state changed to UP Apr 5 04:18:44 gw01 kernel: vlan2: link state changed to UP Apr 5 04:18:44 gw01 kernel: vlan3: link state changed to UP Apr 5 04:18:44 gw01 kernel: vlan4: link state changed to UP Apr 5 04:18:44 gw01 kernel: vlan219: link state changed to UP Apr 5 04:18:44 gw01 kernel: vlan7: link state changed to UP Apr 5 04:18:44 gw01 kernel: vlan8: link state changed to UP Apr 5 04:18:47 gw01 upsd[2545]: write() failed for 10.55.0.143: Connection reset by peer Apr 5 04:18:47 gw01 upsd[2545]: write() failed for 10.55.0.143: Broken pipe
10.55.0.143 is r730-03.
And later:
Apr 5 04:55:10 gw01 kernel: igc3: link state changed to DOWN Apr 5 04:55:10 gw01 kernel: vlan2: link state changed to DOWN Apr 5 04:55:10 gw01 kernel: vlan3: link state changed to DOWN Apr 5 04:55:10 gw01 kernel: vlan4: link state changed to DOWN Apr 5 04:55:10 gw01 kernel: vlan219: link state changed to DOWN Apr 5 04:55:10 gw01 kernel: vlan7: link state changed to DOWN Apr 5 04:55:10 gw01 kernel: vlan8: link state changed to DOWN Apr 5 04:55:13 gw01 kernel: igc3: link state changed to UP Apr 5 04:55:13 gw01 kernel: vlan2: link state changed to UP Apr 5 04:55:13 gw01 kernel: vlan3: link state changed to UP Apr 5 04:55:13 gw01 kernel: vlan4: link state changed to UP Apr 5 04:55:13 gw01 kernel: vlan219: link state changed to UP Apr 5 04:55:13 gw01 kernel: vlan7: link state changed to UP Apr 5 04:55:13 gw01 kernel: vlan8: link state changed to UP
Looking closer, there’s a whole lot of this going on (in the following, I have cut out the logs already shown above):
[15:20 gw01 dvl ~] % grep 'Apr 5' /var/log/messages | grep kernel Apr 5 04:55:10 gw01 kernel: igc3: link state changed to DOWN Apr 5 04:55:10 gw01 kernel: vlan2: link state changed to DOWN Apr 5 04:55:10 gw01 kernel: vlan3: link state changed to DOWN Apr 5 04:55:10 gw01 kernel: vlan4: link state changed to DOWN Apr 5 04:55:10 gw01 kernel: vlan219: link state changed to DOWN Apr 5 04:55:10 gw01 kernel: vlan7: link state changed to DOWN Apr 5 04:55:10 gw01 kernel: vlan8: link state changed to DOWN Apr 5 04:55:13 gw01 kernel: igc3: link state changed to UP Apr 5 04:55:13 gw01 kernel: vlan2: link state changed to UP Apr 5 04:55:13 gw01 kernel: vlan3: link state changed to UP Apr 5 04:55:13 gw01 kernel: vlan4: link state changed to UP Apr 5 04:55:13 gw01 kernel: vlan219: link state changed to UP Apr 5 04:55:13 gw01 kernel: vlan7: link state changed to UP Apr 5 04:55:13 gw01 kernel: vlan8: link state changed to UP Apr 5 06:10:35 gw01 kernel: igc3: link state changed to DOWN Apr 5 06:10:35 gw01 kernel: vlan2: link state changed to DOWN Apr 5 06:10:35 gw01 kernel: vlan3: link state changed to DOWN Apr 5 06:10:35 gw01 kernel: vlan4: link state changed to DOWN Apr 5 06:10:35 gw01 kernel: vlan219: link state changed to DOWN Apr 5 06:10:35 gw01 kernel: vlan7: link state changed to DOWN Apr 5 06:10:35 gw01 kernel: vlan8: link state changed to DOWN Apr 5 06:10:37 gw01 kernel: igc3: link state changed to UP Apr 5 06:10:37 gw01 kernel: vlan2: link state changed to UP Apr 5 06:10:37 gw01 kernel: vlan3: link state changed to UP Apr 5 06:10:37 gw01 kernel: vlan4: link state changed to UP Apr 5 06:10:37 gw01 kernel: vlan219: link state changed to UP Apr 5 06:10:37 gw01 kernel: vlan7: link state changed to UP Apr 5 06:10:37 gw01 kernel: vlan8: link state changed to UP Apr 5 09:24:56 gw01 kernel: igc3: link state changed to DOWN Apr 5 09:24:56 gw01 kernel: vlan2: link state changed to DOWN Apr 5 09:24:56 gw01 kernel: vlan3: link state changed to DOWN Apr 5 09:24:56 gw01 kernel: vlan4: link state changed to DOWN Apr 5 09:24:56 gw01 kernel: vlan219: link state changed to DOWN Apr 5 09:24:56 gw01 kernel: vlan7: link state changed to DOWN Apr 5 09:24:56 gw01 kernel: vlan8: link state changed to DOWN Apr 5 09:24:59 gw01 kernel: igc3: link state changed to UP Apr 5 09:24:59 gw01 kernel: vlan2: link state changed to UP Apr 5 09:24:59 gw01 kernel: vlan3: link state changed to UP Apr 5 09:24:59 gw01 kernel: vlan4: link state changed to UP Apr 5 09:24:59 gw01 kernel: vlan219: link state changed to UP Apr 5 09:24:59 gw01 kernel: vlan7: link state changed to UP Apr 5 09:24:59 gw01 kernel: vlan8: link state changed to UP Apr 5 09:38:26 gw01 kernel: igc3: link state changed to DOWN Apr 5 09:38:26 gw01 kernel: vlan2: link state changed to DOWN Apr 5 09:38:26 gw01 kernel: vlan3: link state changed to DOWN Apr 5 09:38:26 gw01 kernel: vlan4: link state changed to DOWN Apr 5 09:38:26 gw01 kernel: vlan219: link state changed to DOWN Apr 5 09:38:26 gw01 kernel: vlan7: link state changed to DOWN Apr 5 09:38:26 gw01 kernel: vlan8: link state changed to DOWN Apr 5 09:38:29 gw01 kernel: igc3: link state changed to UP Apr 5 09:38:29 gw01 kernel: vlan2: link state changed to UP Apr 5 09:38:29 gw01 kernel: vlan3: link state changed to UP Apr 5 09:38:29 gw01 kernel: vlan4: link state changed to UP Apr 5 09:38:29 gw01 kernel: vlan219: link state changed to UP Apr 5 09:38:29 gw01 kernel: vlan7: link state changed to UP Apr 5 09:38:29 gw01 kernel: vlan8: link state changed to UP Apr 5 09:51:53 gw01 kernel: igc3: link state changed to DOWN Apr 5 09:51:53 gw01 kernel: vlan2: link state changed to DOWN Apr 5 09:51:53 gw01 kernel: vlan3: link state changed to DOWN Apr 5 09:51:53 gw01 kernel: vlan4: link state changed to DOWN Apr 5 09:51:53 gw01 kernel: vlan219: link state changed to DOWN Apr 5 09:51:53 gw01 kernel: vlan7: link state changed to DOWN Apr 5 09:51:53 gw01 kernel: vlan8: link state changed to DOWN Apr 5 09:51:56 gw01 kernel: igc3: link state changed to UP Apr 5 09:51:56 gw01 kernel: vlan2: link state changed to UP Apr 5 09:51:56 gw01 kernel: vlan3: link state changed to UP Apr 5 09:51:56 gw01 kernel: vlan4: link state changed to UP Apr 5 09:51:56 gw01 kernel: vlan219: link state changed to UP Apr 5 09:51:56 gw01 kernel: vlan7: link state changed to UP Apr 5 09:51:56 gw01 kernel: vlan8: link state changed to UP
igc3 is the LAN NIC on gw01. All the VLANs are based on that NIC. For example…. I just went to ssh into the gateway to grab a config. ssh failed.
I went down to the basement and logged into the console. ifconfig looked right. I could not ping hosts on the LAN. Definitely a problem with networking.
Then, it started working. I decided to change the network cable between igc3 and the switch.
Back at my laptop and ssh’d into gw01, I found it had happened again:
Apr 5 15:51:17 gw01 kernel: igc3: link state changed to DOWN Apr 5 15:51:17 gw01 kernel: vlan2: link state changed to DOWN Apr 5 15:51:17 gw01 kernel: vlan3: link state changed to DOWN Apr 5 15:51:17 gw01 kernel: vlan4: link state changed to DOWN Apr 5 15:51:17 gw01 kernel: vlan219: link state changed to DOWN Apr 5 15:51:17 gw01 kernel: vlan7: link state changed to DOWN Apr 5 15:51:17 gw01 kernel: vlan8: link state changed to DOWN Apr 5 15:51:25 gw01 kernel: igc3: link state changed to UP Apr 5 15:51:25 gw01 kernel: vlan2: link state changed to UP Apr 5 15:51:25 gw01 kernel: vlan3: link state changed to UP Apr 5 15:51:25 gw01 kernel: vlan4: link state changed to UP Apr 5 15:51:25 gw01 kernel: vlan219: link state changed to UP Apr 5 15:51:25 gw01 kernel: vlan7: link state changed to UP Apr 5 15:51:25 gw01 kernel: vlan8: link state changed to UP Apr 5 15:51:28 gw01 upsd[2545]: write() failed for 10.55.0.143: Connection reset by peer Apr 5 15:51:28 gw01 upsd[2545]: write() failed for 10.55.0.143: Broken pipe
Other Nagios alerts had fired between 15:47 and 15:53 – This ain’t good.
Over on r730-01:
Apr 5 15:41:54 r730-01 upsmon[1818]: Poll UPS [ups04@gw01.int.unixathome.org] failed - Server disconnected Apr 5 15:41:54 r730-01 upsmon[1818]: Communications with UPS ups04@gw01.int.unixathome.org lost Apr 5 15:42:05 r730-01 upsd[1811]: Data for UPS [ups04] is stale - check driver Apr 5 15:43:20 r730-01 upsd[1811]: UPS [ups04] data is no longer stale Apr 5 15:43:24 r730-01 upsmon[1818]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation Apr 5 15:43:24 r730-01 upsmon[1818]: Poll UPS [heartbeat] failed - Server disconnected Apr 5 15:43:24 r730-01 upsmon[1818]: Communications with UPS heartbeat lost Apr 5 15:43:24 r730-01 upsmon[1818]: UPS [heartbeat] was last known to be not fully online and currently is not communi Apr 5 15:43:36 r730-01 upsd[1811]: Data for UPS [ups04] is stale - check driver Apr 5 15:44:46 r730-01 upsd[1811]: UPS [ups04] data is no longer stale Apr 5 15:44:54 r730-01 upsmon[1818]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation Apr 5 15:44:54 r730-01 upsmon[1818]: Communications with UPS heartbeat established Apr 5 15:44:54 r730-01 upsmon[1818]: UPS heartbeat on line power Apr 5 15:45:02 r730-01 upsd[1811]: Data for UPS [ups04] is stale - check driver Apr 5 15:46:12 r730-01 upsd[1811]: UPS [ups04] data is no longer stale Apr 5 15:46:24 r730-01 upsmon[1818]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation Apr 5 15:46:24 r730-01 upsmon[1818]: Poll UPS [heartbeat] failed - Server disconnected Apr 5 15:46:24 r730-01 upsmon[1818]: Communications with UPS heartbeat lost Apr 5 15:46:28 r730-01 upsd[1811]: Data for UPS [ups04] is stale - check driver Apr 5 15:47:27 r730-01 upsd[1811]: UPS [ups04] data is no longer stale Apr 5 15:47:44 r730-01 upsmon[1818]: Communications with UPS ups04@gw01.int.unixathome.org established Apr 5 15:47:44 r730-01 upsmon[1818]: Communications with UPS heartbeat established Apr 5 15:49:54 r730-01 upsmon[1818]: UPS heartbeat on battery Apr 5 15:51:25 r730-01 upsmon[1818]: Poll UPS [ups04@gw01.int.unixathome.org] failed - Server disconnected Apr 5 15:51:25 r730-01 upsmon[1818]: Communications with UPS ups04@gw01.int.unixathome.org lost Apr 5 15:51:30 r730-01 upsmon[1818]: Communications with UPS ups04@gw01.int.unixathome.org established
For now, I have an event to attend. I’ll come back to this afterwards.
After the event
It is now 18:35 UTC – the networks issues have not been seen again. I’ll give it another 24-48 hours before declaring victory.
I should also test the cable I removed.
Speaking of cable testing… let’s look at the metrics.

That’s the past 24 hours. Let’s zoom in.

Look at those error spikes. I time them to be at about, from left to right:
- 03:20
- 04:10
- 04:50
- 06:10
- 09:50
- 15:40
Those times, pulled from the graphs based on my eyes, match up closely enough to all the known interfaces problems.
So, what about the switch side of that cable?

Nothing. Nothing at all over the past two days.
I’m thinking it was the cable.