I recently added 10 new HDD to a system which already had 10 HDD (now a total of 20).
The HDD are split into two zpools:
$ zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT system 27T 22.1T 4.93T - 19% 81% 1.00x ONLINE - tank_data 45.2T 26.3T 19.0T - 37% 58% 1.00x ONLINE - zoomzoom 436G 9.67G 426G - 6% 2% 1.00x ONLINE -
The zoomzoom zpool is a pair of SSDs used for special purposes.
The existing zpool, system, was filling up. The new zpool, tank_data was to provide more space.
The backups, already in the system zpool, were to be copied to tank_data.
Some details:
- system – 10 x 3TB drive raidz2 pool – 8 drives on mps0, 2 on msp2
- tank_data – 10 x 5TB drive raidz3 pool – 8 drives on msp1, 2 on msp2
NOTE: two drives from each pool are on the same controller.
NOTE: the system zpool is at about 81%
Here’s the timeline, all in UTC.
2 Dec
I had created a new Bacula Storage Daemon, running in a jail on this server. Backups were headed to this new zpool, tank_data.
At about 0253 UTC, I kicked off a zfs send | zfs recv between two zpools on a server.
3 Dec
The next morning, I started tweeting about the Nagios alerts that I’d woken up to.
The sonewconn storm started at 03:03:28 UTC on Friday (really, Thu night EST, shortly after I started the send | recv).
The server was not responding to ssh or to console login. I assumed I had locked up the system with the send | recv. My plan was to leave it for the weekend to complete, as I was away.
This storm continued, dumping notices into in /var/log/messages, until the system was rebooted at Dec 6 02:35:37 UTC.
There was a log entry every minute or so mentioning sonewconn.
I conclude that the disk system was OK because:
- notices kept appearing in /var/log/messsages
- the zfs send | zfs recv completed
I did notice this in the logs but I’m not sure if it’s relevant.
Dec 6 00:26:34 knew kernel: sonewconn: pcb 0xfffff803ab754000: Listen queue overflow: 151 already in queue awaiting acceptance (17 occurrences) Dec 6 00:26:43 knew kernel: Limiting closed port RST response from 277 to 200 packets/sec Dec 6 00:26:44 knew kernel: Limiting closed port RST response from 256 to 200 packets/sec Dec 6 00:26:46 knew kernel: Limiting closed port RST response from 269 to 200 packets/sec Dec 6 00:26:47 knew kernel: Limiting closed port RST response from 307 to 200 packets/sec Dec 6 00:27:34 knew kernel: sonewconn: pcb 0xfffff803ab754000: Listen queue overflow: 151 already in queue awaiting acceptance (169 occurrences)
Dec 6
As mentioned in the previous section, I rebooted the server and it came back up at about 02:35:37 UTC.
There was no storm this night, just one entry:
Dec 6 03:01:53 knew kernel: sonewconn: pcb 0xfffff8005eb50188: Listen queue overflow: 8 already in queue awaiting acceptance (1 occurrences)
Dec 7
I was working on the system that night and shut it down at about 0226 UTC. I was working on the server hardware and powered it back up at 0321 UTC.
No storm.
Dec 8
Storm started at 0301 UTC.
sonewconn added to /var/log/messages every minute or so.
Dec 9 13:53:24 knew kernel: sonewconn: pcb 0xfffff803d40cb960: Listen queue overflow: 151 already in queue awaiting acceptance (8 occurrences) Dec 9 13:54:25 knew kernel: sonewconn: pcb 0xfffff803d40cb960: Listen queue overflow: 151 already in queue awaiting acceptance (8 occurrences) Dec 9 13:54:42 knew kernel: Limiting closed port RST response from 550 to 200 packets/sec Dec 9 13:54:43 knew kernel: Limiting closed port RST response from 410 to 200 packets/sec Dec 9 13:54:44 knew kernel: Limiting closed port RST response from 460 to 200 packets/sec Dec 9 13:54:45 knew kernel: Limiting closed port RST response from 527 to 200 packets/sec Dec 9 13:55:24 knew kernel: sonewconn: pcb 0xfffff803d40cb960: Listen queue overflow: 151 already in queue awaiting acceptance (81 occurrences) Dec 9 13:56:25 knew kernel: sonewconn: pcb 0xfffff803d40cb960: Listen queue overflow: 151 already in queue awaiting acceptance (129 occurrences)
System was rebooted at about 1241 UTC.
Dec 9
Storm started at 0302 UTC.
sonewconn added to /var/log/messages every minute or so.
System was rebooted at about 1438 UTC.
Points
- There has been a sonewconn every night since backups started using tank_data with one exception: when the system was rebooted about 25 minutes prior to the time the storms have occurred- Coincidence?
- The system zpool is active during this time. We know that because we can see the new sonewconn log entries in /var/log/messages every minute or so.
- The tank_data pool is active during this time because Bacula uses that as a backup destination. Backups are succeeding.
Freeze symptoms
When I say freeze, the symptoms are:
- New ssh connections fail to connect
- Existing ssh connections fail to run commands
- All Nagios checks fail with errors such as CHECK_NRPE: Error – Could not complete SSL handshake
- Attempts to login at the console: user name and password are accept, last login message is flashed, but never get a login prompt.
- existing gstat command in a tmux session stopped responding
Current theories
Here’s what I think may be wrong:
- PSU being overloaded. It’s an 850W PSU with 24A on the 5V rail and 70A on the 12V rail. There are 10 HDD at 12 W and 10 HDD at 7 W, for a total of 190W, or about 16A. There is plenty of capacity there on this single rail PSU.
- Not enough connections. That’s that the message says. But what connections?
- Why has this started only since writing to the new pool?