network connectivity issues – in the #homelab

Yesterday a big storm came through. Many downed trees (photos here). The power went off at about 0500 and came back again at about 1009, off again at 1309, back at 1319. It was then stable for the rest of the day.

Overnight, I saw monitoring alerts which initially made me think there was another power cut. uptime showed that was not the case. I also didn’t see any timeout messages meaning the UPS was powering down.

This post is for me to write down what I’m seeing in the logs and monitoring. Perhaps I’ll figure out the cause.

All times are UTC.

First hint – camera

It was my camera notification saying it was offline at 0334 and then again at 0313 which first brought this to my attention. I must note that these alerts arrive after the camera has not been streaming for a given time. The events which caused the camera to go offline must have been earlier.

03:20 – r730-01 & r730-03 lose heartbeat connect to gw01

Looking in other logs, it’s saying it could not talk to gw01.

Apr  5 03:20:49 r730-01 upsmon[1818]: Poll UPS [ups04@gw01.int.unixathome.org] failed - Server disconnected
Apr  5 03:20:49 r730-01 upsmon[1818]: Communications with UPS ups04@gw01.int.unixathome.org lost
Apr  5 03:20:59 r730-01 upsd[1811]: Data for UPS [ups04] is stale - check driver
Apr  5 03:22:14 r730-01 upsd[1811]: UPS [ups04] data is no longer stale
Apr  5 03:22:19 r730-01 upsmon[1818]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out
Apr  5 03:22:19 r730-01 upsmon[1818]: Poll UPS [heartbeat] failed - Server disconnected
Apr  5 03:22:19 r730-01 upsmon[1818]: Communications with UPS heartbeat lost
Apr  5 03:22:19 r730-01 upsmon[1818]: UPS [heartbeat] was last known to be not fully online and currently is not communicating, assuming dead
Apr  5 03:22:30 r730-01 upsd[1811]: Data for UPS [ups04] is stale - check driver
Apr  5 03:23:40 r730-01 upsd[1811]: UPS [ups04] data is no longer stale
Apr  5 03:23:49 r730-01 upsmon[1818]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out
Apr  5 03:23:49 r730-01 upsmon[1818]: Communications with UPS heartbeat established
Apr  5 03:23:56 r730-01 upsd[1811]: Data for UPS [ups04] is stale - check driver
Apr  5 03:25:06 r730-01 upsd[1811]: UPS [ups04] data is no longer stale
Apr  5 03:25:19 r730-01 upsmon[1818]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out
Apr  5 03:25:19 r730-01 upsmon[1818]: Poll UPS [heartbeat] failed - Server disconnected
Apr  5 03:25:19 r730-01 upsmon[1818]: Communications with UPS heartbeat lost
Apr  5 03:25:22 r730-01 upsd[1811]: Data for UPS [ups04] is stale - check driver
Apr  5 03:26:32 r730-01 upsd[1811]: UPS [ups04] data is no longer stale
Apr  5 03:26:38 r730-01 upsmon[1818]: Communications with UPS ups04@gw01.int.unixathome.org established
Apr  5 03:26:38 r730-01 upsmon[1818]: Communications with UPS heartbeat established

And the other main host:

Apr  5 03:20:51 r730-03 upsmon[1897]: Poll UPS [ups04@gw01.int.unixathome.org] failed - Server disconnected
Apr  5 03:20:51 r730-03 upsmon[1897]: Communications with UPS ups04@gw01.int.unixathome.org lost
Apr  5 03:20:58 r730-03 upsd[1876]: Data for UPS [ups04] is stale - check driver
Apr  5 03:22:14 r730-03 upsd[1876]: UPS [ups04] data is no longer stale
Apr  5 03:22:21 r730-03 upsmon[1897]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out
Apr  5 03:22:21 r730-03 upsmon[1897]: Poll UPS [heartbeat] failed - Server disconnected
Apr  5 03:22:21 r730-03 upsmon[1897]: Communications with UPS heartbeat lost
Apr  5 03:22:21 r730-03 upsmon[1897]: UPS [heartbeat] was last known to be not fully online and currently is not communicating, assuming dead
Apr  5 03:22:30 r730-03 upsd[1876]: Data for UPS [ups04] is stale - check driver
Apr  5 03:23:32 r730-03 smartd[15629]: Device: /dev/da1 [SAT], 16 Currently unreadable (pending) sectors
Apr  5 03:23:32 r730-03 smartd[15629]: Device: /dev/da1 [SAT], 2 Offline uncorrectable sectors
Apr  5 03:23:41 r730-03 upsd[1876]: UPS [ups04] data is no longer stale
Apr  5 03:23:51 r730-03 upsmon[1897]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out
Apr  5 03:23:51 r730-03 upsmon[1897]: Communications with UPS heartbeat established
Apr  5 03:23:57 r730-03 upsd[1876]: Data for UPS [ups04] is stale - check driver
Apr  5 03:25:07 r730-03 upsd[1876]: UPS [ups04] data is no longer stale
Apr  5 03:25:21 r730-03 upsmon[1897]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out
Apr  5 03:25:21 r730-03 upsmon[1897]: Poll UPS [heartbeat] failed - Server disconnected
Apr  5 03:25:21 r730-03 upsmon[1897]: Communications with UPS heartbeat lost
Apr  5 03:25:21 r730-03 upsmon[1897]: UPS [heartbeat] was last known to be not fully online and currently is not communicating, assuming dead
Apr  5 03:25:23 r730-03 upsd[1876]: Data for UPS [ups04] is stale - check driver
Apr  5 03:26:33 r730-03 upsd[1876]: UPS [ups04] data is no longer stale
Apr  5 03:26:41 r730-03 upsmon[1897]: Communications with UPS ups04@gw01.int.unixathome.org established
Apr  5 03:26:41 r730-03 upsmon[1897]: Communications with UPS heartbeat established

Based on those messages, the event started at about 03:20:49 and ended at about 03:26:33.

03:26 – gw01

On the gateway (gw01, I found these log messages (I have filtered out the openvpn messages).

Apr  5 03:19:00 gw01 upsmon[2554]: UPS heartbeat on battery
Apr  5 03:20:50 gw01 root[56295]: dhclient-exit-hooks but for something we ignore (reason: RENEW, iface: igc0)
Apr  5 03:24:02 gw01 upsmon[2554]: UPS heartbeat on line power
Apr  5 03:26:17 gw01 kernel: igc3: link state changed to DOWN
Apr  5 03:26:17 gw01 kernel: vlan2: link state changed to DOWN
Apr  5 03:26:17 gw01 kernel: vlan3: link state changed to DOWN
Apr  5 03:26:17 gw01 kernel: vlan4: link state changed to DOWN
Apr  5 03:26:17 gw01 kernel: vlan219: link state changed to DOWN
Apr  5 03:26:17 gw01 kernel: vlan7: link state changed to DOWN
Apr  5 03:26:17 gw01 kernel: vlan8: link state changed to DOWN
Apr  5 03:26:18 gw01 radvd[2378]: sendmsg: Network is down
Apr  5 03:26:19 gw01 kernel: igc3: link state changed to UP
Apr  5 03:26:19 gw01 kernel: vlan2: link state changed to UP
Apr  5 03:26:19 gw01 kernel: vlan3: link state changed to UP
Apr  5 03:26:19 gw01 kernel: vlan4: link state changed to UP
Apr  5 03:26:19 gw01 kernel: vlan219: link state changed to UP
Apr  5 03:26:19 gw01 kernel: vlan7: link state changed to UP
Apr  5 03:26:19 gw01 kernel: vlan8: link state changed to UP

But that does not account for the Poll UPS disconnections at about 03:20 on r730-01 and r730-03.

Nearly an hour later:

Apr  5 04:15:16 gw01 kernel: igc3: link state changed to DOWN
Apr  5 04:15:16 gw01 kernel: vlan2: link state changed to DOWN
Apr  5 04:15:16 gw01 kernel: vlan3: link state changed to DOWN
Apr  5 04:15:16 gw01 kernel: vlan4: link state changed to DOWN
Apr  5 04:15:16 gw01 kernel: vlan219: link state changed to DOWN
Apr  5 04:15:16 gw01 kernel: vlan7: link state changed to DOWN
Apr  5 04:15:16 gw01 kernel: vlan8: link state changed to DOWN
Apr  5 04:15:22 gw01 kernel: igc3: link state changed to UP
Apr  5 04:15:22 gw01 kernel: vlan2: link state changed to UP
Apr  5 04:15:22 gw01 kernel: vlan3: link state changed to UP
Apr  5 04:15:22 gw01 kernel: vlan4: link state changed to UP
Apr  5 04:15:22 gw01 kernel: vlan219: link state changed to UP
Apr  5 04:15:22 gw01 kernel: vlan7: link state changed to UP
Apr  5 04:15:22 gw01 kernel: vlan8: link state changed to UP

Followed by:

Apr  5 04:18:41 gw01 kernel: igc3: link state changed to DOWN
Apr  5 04:18:41 gw01 kernel: vlan2: link state changed to DOWN
Apr  5 04:18:41 gw01 kernel: vlan3: link state changed to DOWN
Apr  5 04:18:41 gw01 kernel: vlan4: link state changed to DOWN
Apr  5 04:18:41 gw01 kernel: vlan219: link state changed to DOWN
Apr  5 04:18:41 gw01 kernel: vlan7: link state changed to DOWN
Apr  5 04:18:41 gw01 kernel: vlan8: link state changed to DOWN
Apr  5 04:18:41 gw01 radvd[2378]: sendmsg: Network is down
Apr  5 04:18:44 gw01 kernel: igc3: link state changed to UP
Apr  5 04:18:44 gw01 kernel: vlan2: link state changed to UP
Apr  5 04:18:44 gw01 kernel: vlan3: link state changed to UP
Apr  5 04:18:44 gw01 kernel: vlan4: link state changed to UP
Apr  5 04:18:44 gw01 kernel: vlan219: link state changed to UP
Apr  5 04:18:44 gw01 kernel: vlan7: link state changed to UP
Apr  5 04:18:44 gw01 kernel: vlan8: link state changed to UP
Apr  5 04:18:47 gw01 upsd[2545]: write() failed for 10.55.0.143: Connection reset by peer
Apr  5 04:18:47 gw01 upsd[2545]: write() failed for 10.55.0.143: Broken pipe

10.55.0.143 is r730-03.

And later:

Apr  5 04:55:10 gw01 kernel: igc3: link state changed to DOWN
Apr  5 04:55:10 gw01 kernel: vlan2: link state changed to DOWN
Apr  5 04:55:10 gw01 kernel: vlan3: link state changed to DOWN
Apr  5 04:55:10 gw01 kernel: vlan4: link state changed to DOWN
Apr  5 04:55:10 gw01 kernel: vlan219: link state changed to DOWN
Apr  5 04:55:10 gw01 kernel: vlan7: link state changed to DOWN
Apr  5 04:55:10 gw01 kernel: vlan8: link state changed to DOWN
Apr  5 04:55:13 gw01 kernel: igc3: link state changed to UP
Apr  5 04:55:13 gw01 kernel: vlan2: link state changed to UP
Apr  5 04:55:13 gw01 kernel: vlan3: link state changed to UP
Apr  5 04:55:13 gw01 kernel: vlan4: link state changed to UP
Apr  5 04:55:13 gw01 kernel: vlan219: link state changed to UP
Apr  5 04:55:13 gw01 kernel: vlan7: link state changed to UP
Apr  5 04:55:13 gw01 kernel: vlan8: link state changed to UP

Looking closer, there’s a whole lot of this going on (in the following, I have cut out the logs already shown above):

[15:20 gw01 dvl ~] % grep 'Apr  5' /var/log/messages | grep kernel
Apr  5 04:55:10 gw01 kernel: igc3: link state changed to DOWN
Apr  5 04:55:10 gw01 kernel: vlan2: link state changed to DOWN
Apr  5 04:55:10 gw01 kernel: vlan3: link state changed to DOWN
Apr  5 04:55:10 gw01 kernel: vlan4: link state changed to DOWN
Apr  5 04:55:10 gw01 kernel: vlan219: link state changed to DOWN
Apr  5 04:55:10 gw01 kernel: vlan7: link state changed to DOWN
Apr  5 04:55:10 gw01 kernel: vlan8: link state changed to DOWN
Apr  5 04:55:13 gw01 kernel: igc3: link state changed to UP
Apr  5 04:55:13 gw01 kernel: vlan2: link state changed to UP
Apr  5 04:55:13 gw01 kernel: vlan3: link state changed to UP
Apr  5 04:55:13 gw01 kernel: vlan4: link state changed to UP
Apr  5 04:55:13 gw01 kernel: vlan219: link state changed to UP
Apr  5 04:55:13 gw01 kernel: vlan7: link state changed to UP
Apr  5 04:55:13 gw01 kernel: vlan8: link state changed to UP
Apr  5 06:10:35 gw01 kernel: igc3: link state changed to DOWN
Apr  5 06:10:35 gw01 kernel: vlan2: link state changed to DOWN
Apr  5 06:10:35 gw01 kernel: vlan3: link state changed to DOWN
Apr  5 06:10:35 gw01 kernel: vlan4: link state changed to DOWN
Apr  5 06:10:35 gw01 kernel: vlan219: link state changed to DOWN
Apr  5 06:10:35 gw01 kernel: vlan7: link state changed to DOWN
Apr  5 06:10:35 gw01 kernel: vlan8: link state changed to DOWN
Apr  5 06:10:37 gw01 kernel: igc3: link state changed to UP
Apr  5 06:10:37 gw01 kernel: vlan2: link state changed to UP
Apr  5 06:10:37 gw01 kernel: vlan3: link state changed to UP
Apr  5 06:10:37 gw01 kernel: vlan4: link state changed to UP
Apr  5 06:10:37 gw01 kernel: vlan219: link state changed to UP
Apr  5 06:10:37 gw01 kernel: vlan7: link state changed to UP
Apr  5 06:10:37 gw01 kernel: vlan8: link state changed to UP
Apr  5 09:24:56 gw01 kernel: igc3: link state changed to DOWN
Apr  5 09:24:56 gw01 kernel: vlan2: link state changed to DOWN
Apr  5 09:24:56 gw01 kernel: vlan3: link state changed to DOWN
Apr  5 09:24:56 gw01 kernel: vlan4: link state changed to DOWN
Apr  5 09:24:56 gw01 kernel: vlan219: link state changed to DOWN
Apr  5 09:24:56 gw01 kernel: vlan7: link state changed to DOWN
Apr  5 09:24:56 gw01 kernel: vlan8: link state changed to DOWN
Apr  5 09:24:59 gw01 kernel: igc3: link state changed to UP
Apr  5 09:24:59 gw01 kernel: vlan2: link state changed to UP
Apr  5 09:24:59 gw01 kernel: vlan3: link state changed to UP
Apr  5 09:24:59 gw01 kernel: vlan4: link state changed to UP
Apr  5 09:24:59 gw01 kernel: vlan219: link state changed to UP
Apr  5 09:24:59 gw01 kernel: vlan7: link state changed to UP
Apr  5 09:24:59 gw01 kernel: vlan8: link state changed to UP
Apr  5 09:38:26 gw01 kernel: igc3: link state changed to DOWN
Apr  5 09:38:26 gw01 kernel: vlan2: link state changed to DOWN
Apr  5 09:38:26 gw01 kernel: vlan3: link state changed to DOWN
Apr  5 09:38:26 gw01 kernel: vlan4: link state changed to DOWN
Apr  5 09:38:26 gw01 kernel: vlan219: link state changed to DOWN
Apr  5 09:38:26 gw01 kernel: vlan7: link state changed to DOWN
Apr  5 09:38:26 gw01 kernel: vlan8: link state changed to DOWN
Apr  5 09:38:29 gw01 kernel: igc3: link state changed to UP
Apr  5 09:38:29 gw01 kernel: vlan2: link state changed to UP
Apr  5 09:38:29 gw01 kernel: vlan3: link state changed to UP
Apr  5 09:38:29 gw01 kernel: vlan4: link state changed to UP
Apr  5 09:38:29 gw01 kernel: vlan219: link state changed to UP
Apr  5 09:38:29 gw01 kernel: vlan7: link state changed to UP
Apr  5 09:38:29 gw01 kernel: vlan8: link state changed to UP
Apr  5 09:51:53 gw01 kernel: igc3: link state changed to DOWN
Apr  5 09:51:53 gw01 kernel: vlan2: link state changed to DOWN
Apr  5 09:51:53 gw01 kernel: vlan3: link state changed to DOWN
Apr  5 09:51:53 gw01 kernel: vlan4: link state changed to DOWN
Apr  5 09:51:53 gw01 kernel: vlan219: link state changed to DOWN
Apr  5 09:51:53 gw01 kernel: vlan7: link state changed to DOWN
Apr  5 09:51:53 gw01 kernel: vlan8: link state changed to DOWN
Apr  5 09:51:56 gw01 kernel: igc3: link state changed to UP
Apr  5 09:51:56 gw01 kernel: vlan2: link state changed to UP
Apr  5 09:51:56 gw01 kernel: vlan3: link state changed to UP
Apr  5 09:51:56 gw01 kernel: vlan4: link state changed to UP
Apr  5 09:51:56 gw01 kernel: vlan219: link state changed to UP
Apr  5 09:51:56 gw01 kernel: vlan7: link state changed to UP
Apr  5 09:51:56 gw01 kernel: vlan8: link state changed to UP

igc3 is the LAN NIC on gw01. All the VLANs are based on that NIC. For example…. I just went to ssh into the gateway to grab a config. ssh failed.

I went down to the basement and logged into the console. ifconfig looked right. I could not ping hosts on the LAN. Definitely a problem with networking.

Then, it started working. I decided to change the network cable between igc3 and the switch.

Back at my laptop and ssh’d into gw01, I found it had happened again:

Apr  5 15:51:17 gw01 kernel: igc3: link state changed to DOWN
Apr  5 15:51:17 gw01 kernel: vlan2: link state changed to DOWN
Apr  5 15:51:17 gw01 kernel: vlan3: link state changed to DOWN
Apr  5 15:51:17 gw01 kernel: vlan4: link state changed to DOWN
Apr  5 15:51:17 gw01 kernel: vlan219: link state changed to DOWN
Apr  5 15:51:17 gw01 kernel: vlan7: link state changed to DOWN
Apr  5 15:51:17 gw01 kernel: vlan8: link state changed to DOWN
Apr  5 15:51:25 gw01 kernel: igc3: link state changed to UP
Apr  5 15:51:25 gw01 kernel: vlan2: link state changed to UP
Apr  5 15:51:25 gw01 kernel: vlan3: link state changed to UP
Apr  5 15:51:25 gw01 kernel: vlan4: link state changed to UP
Apr  5 15:51:25 gw01 kernel: vlan219: link state changed to UP
Apr  5 15:51:25 gw01 kernel: vlan7: link state changed to UP
Apr  5 15:51:25 gw01 kernel: vlan8: link state changed to UP
Apr  5 15:51:28 gw01 upsd[2545]: write() failed for 10.55.0.143: Connection reset by peer
Apr  5 15:51:28 gw01 upsd[2545]: write() failed for 10.55.0.143: Broken pipe

Other Nagios alerts had fired between 15:47 and 15:53 – This ain’t good.

Over on r730-01:

Apr  5 15:41:54 r730-01 upsmon[1818]: Poll UPS [ups04@gw01.int.unixathome.org] failed - Server disconnected
Apr  5 15:41:54 r730-01 upsmon[1818]: Communications with UPS ups04@gw01.int.unixathome.org lost
Apr  5 15:42:05 r730-01 upsd[1811]: Data for UPS [ups04] is stale - check driver
Apr  5 15:43:20 r730-01 upsd[1811]: UPS [ups04] data is no longer stale
Apr  5 15:43:24 r730-01 upsmon[1818]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation
Apr  5 15:43:24 r730-01 upsmon[1818]: Poll UPS [heartbeat] failed - Server disconnected
Apr  5 15:43:24 r730-01 upsmon[1818]: Communications with UPS heartbeat lost
Apr  5 15:43:24 r730-01 upsmon[1818]: UPS [heartbeat] was last known to be not fully online and currently is not communi
Apr  5 15:43:36 r730-01 upsd[1811]: Data for UPS [ups04] is stale - check driver
Apr  5 15:44:46 r730-01 upsd[1811]: UPS [ups04] data is no longer stale
Apr  5 15:44:54 r730-01 upsmon[1818]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation
Apr  5 15:44:54 r730-01 upsmon[1818]: Communications with UPS heartbeat established
Apr  5 15:44:54 r730-01 upsmon[1818]: UPS heartbeat on line power
Apr  5 15:45:02 r730-01 upsd[1811]: Data for UPS [ups04] is stale - check driver
Apr  5 15:46:12 r730-01 upsd[1811]: UPS [ups04] data is no longer stale
Apr  5 15:46:24 r730-01 upsmon[1818]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation
Apr  5 15:46:24 r730-01 upsmon[1818]: Poll UPS [heartbeat] failed - Server disconnected
Apr  5 15:46:24 r730-01 upsmon[1818]: Communications with UPS heartbeat lost
Apr  5 15:46:28 r730-01 upsd[1811]: Data for UPS [ups04] is stale - check driver
Apr  5 15:47:27 r730-01 upsd[1811]: UPS [ups04] data is no longer stale
Apr  5 15:47:44 r730-01 upsmon[1818]: Communications with UPS ups04@gw01.int.unixathome.org established
Apr  5 15:47:44 r730-01 upsmon[1818]: Communications with UPS heartbeat established
Apr  5 15:49:54 r730-01 upsmon[1818]: UPS heartbeat on battery
Apr  5 15:51:25 r730-01 upsmon[1818]: Poll UPS [ups04@gw01.int.unixathome.org] failed - Server disconnected
Apr  5 15:51:25 r730-01 upsmon[1818]: Communications with UPS ups04@gw01.int.unixathome.org lost
Apr  5 15:51:30 r730-01 upsmon[1818]: Communications with UPS ups04@gw01.int.unixathome.org established

For now, I have an event to attend. I’ll come back to this afterwards.

After the event

It is now 18:35 UTC – the networks issues have not been seen again. I’ll give it another 24-48 hours before declaring victory.

I should also test the cable I removed.

Speaking of cable testing… let’s look at the metrics.

icg3 errors over the past 24 hours
icg3 errors over the past 24 hours

That’s the past 24 hours. Let’s zoom in.

errors starting at 15:40
errors starting at 15:40

errors starting at 03:20
errors starting at 03:20

Look at those error spikes. I time them to be at about, from left to right:

  1. 03:20
  2. 04:10
  3. 04:50
  4. 06:10
  5. 09:50
  6. 15:40

Those times, pulled from the graphs based on my eyes, match up closely enough to all the known interfaces problems.

So, what about the switch side of that cable?

errors on switch port of gateway
errors on switch port of gateway

Nothing. Nothing at all over the past two days.

I’m thinking it was the cable.

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive

Leave a Comment

Scroll to Top