system freezes up with lots of sonewconn Listen queue overflow

I recently added 10 new HDD to a system which already had 10 HDD (now a total of 20).

The HDD are split into two zpools:

$ zpool list
NAME        SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
system       27T  22.1T  4.93T         -    19%    81%  1.00x  ONLINE  -
tank_data  45.2T  26.3T  19.0T         -    37%    58%  1.00x  ONLINE  -
zoomzoom    436G  9.67G   426G         -     6%     2%  1.00x  ONLINE  -

The zoomzoom zpool is a pair of SSDs used for special purposes.

The existing zpool, system, was filling up. The new zpool, tank_data was to provide more space.

The backups, already in the system zpool, were to be copied to tank_data.

Some details:

system – 10 x 3TB drive raidz2 pool – 8 drives on mps0, 2 on msp2
tank_data – 10 x 5TB drive raidz3 pool – 8 drives on msp1, 2 on msp2

NOTE: two drives from each pool are on the same controller.

NOTE: the system zpool is at about 81%

Here’s the timeline, all in UTC.

2 Dec

I had created a new Bacula Storage Daemon, running in a jail on this server. Backups were headed to this new zpool, tank_data.

At about 0253 UTC, I kicked off a zfs send | zfs recv between two zpools on a server.

3 Dec

The next morning, I started tweeting about the Nagios alerts that I’d woken up to.

The sonewconn storm started at 03:03:28 UTC on Friday (really, Thu night EST, shortly after I started the send | recv).

The server was not responding to ssh or to console login. I assumed I had locked up the system with the send | recv. My plan was to leave it for the weekend to complete, as I was away.

This storm continued, dumping notices into in /var/log/messages, until the system was rebooted at Dec 6 02:35:37 UTC.

There was a log entry every minute or so mentioning sonewconn.

I conclude that the disk system was OK because:

notices kept appearing in /var/log/messsages
the zfs send | zfs recv completed

I did notice this in the logs but I’m not sure if it’s relevant.

Dec  6 00:26:34 knew kernel: sonewconn: pcb 0xfffff803ab754000: Listen queue overflow: 151 already in queue awaiting acceptance (17 occurrences)
Dec  6 00:26:43 knew kernel: Limiting closed port RST response from 277 to 200 packets/sec
Dec  6 00:26:44 knew kernel: Limiting closed port RST response from 256 to 200 packets/sec
Dec  6 00:26:46 knew kernel: Limiting closed port RST response from 269 to 200 packets/sec
Dec  6 00:26:47 knew kernel: Limiting closed port RST response from 307 to 200 packets/sec
Dec  6 00:27:34 knew kernel: sonewconn: pcb 0xfffff803ab754000: Listen queue overflow: 151 already in queue awaiting acceptance (169 occurrences)

Dec 6

As mentioned in the previous section, I rebooted the server and it came back up at about 02:35:37 UTC.

There was no storm this night, just one entry:

Dec  6 03:01:53 knew kernel: sonewconn: pcb 0xfffff8005eb50188: Listen queue overflow: 8 already in queue awaiting acceptance (1 occurrences)

Dec 7

I was working on the system that night and shut it down at about 0226 UTC. I was working on the server hardware and powered it back up at 0321 UTC.

No storm.

Dec 8

Storm started at 0301 UTC.

sonewconn added to /var/log/messages every minute or so.

Dec  9 13:53:24 knew kernel: sonewconn: pcb 0xfffff803d40cb960: Listen queue overflow: 151 already in queue awaiting acceptance (8 occurrences)
Dec  9 13:54:25 knew kernel: sonewconn: pcb 0xfffff803d40cb960: Listen queue overflow: 151 already in queue awaiting acceptance (8 occurrences)
Dec  9 13:54:42 knew kernel: Limiting closed port RST response from 550 to 200 packets/sec
Dec  9 13:54:43 knew kernel: Limiting closed port RST response from 410 to 200 packets/sec
Dec  9 13:54:44 knew kernel: Limiting closed port RST response from 460 to 200 packets/sec
Dec  9 13:54:45 knew kernel: Limiting closed port RST response from 527 to 200 packets/sec
Dec  9 13:55:24 knew kernel: sonewconn: pcb 0xfffff803d40cb960: Listen queue overflow: 151 already in queue awaiting acceptance (81 occurrences)
Dec  9 13:56:25 knew kernel: sonewconn: pcb 0xfffff803d40cb960: Listen queue overflow: 151 already in queue awaiting acceptance (129 occurrences)

System was rebooted at about 1241 UTC.

Dec 9

Storm started at 0302 UTC.

sonewconn added to /var/log/messages every minute or so.

System was rebooted at about 1438 UTC.

Points

There has been a sonewconn every night since backups started using tank_data with one exception: when the system was rebooted about 25 minutes prior to the time the storms have occurred- Coincidence?
The system zpool is active during this time. We know that because we can see the new sonewconn log entries in /var/log/messages every minute or so.
The tank_data pool is active during this time because Bacula uses that as a backup destination. Backups are succeeding.

Freeze symptoms

When I say freeze, the symptoms are:

New ssh connections fail to connect
Existing ssh connections fail to run commands
All Nagios checks fail with errors such as CHECK_NRPE: Error – Could not complete SSL handshake
Attempts to login at the console: user name and password are accept, last login message is flashed, but never get a login prompt.
existing gstat command in a tmux session stopped responding

Current theories

Here’s what I think may be wrong:

PSU being overloaded. It’s an 850W PSU with 24A on the 5V rail and 70A on the 12V rail. There are 10 HDD at 12 W and 10 HDD at 7 W, for a total of 190W, or about 16A. There is plenty of capacity there on this single rail PSU.
Not enough connections. That’s that the message says. But what connections?
Why has this started only since writing to the new pool?