I had a Bacula job fail today:
22-Oct 20:08 nyi-fd JobId 147614: Warning: bsock.c:128 Could not connect to Storage daemon on crey.example.org:9103. ERR=Operation timed out Retrying ... 22-Oct 20:35 nyi-fd JobId 147614: Fatal error: bsock.c:134 Unable to connect to Storage daemon on crey.example.org:9103. ERR=Interrupted system call 22-Oct 20:35 nyi-fd JobId 147614: Fatal error: Failed to connect to Storage daemon: crey.example.org:9103 22-Oct 20:35 bacula-dir JobId 147614: Fatal error: Bad response to Storage command: wanted 2000 OK storage, got 2902 Bad storage
Is bacula-sd running on crey? Yes it is. Can I telnet to port 9103 on crey? Yes, I can:
$ telnet 10.5.0.20 9103 Trying 10.5.0.20... Connected to crey.example.org. Escape character is '^]'.
What about from the nyi-fd server? Can I telnet from there?
$ telnet 10.5.0.20 9103 Trying 10.5.0.20... telnet: connect to address 10.5.0.20: Operation timed out telnet: Unable to connect to remote host
I started tcpdump on the gateway, and on the 10.5.0.20 host. Things just weren’t getting through. On the host which could not be reached, I saw:
20:53:23.777116 ARP, Request who-has 10.4.2.20 tell 10.5.0.20, length 46 20:53:24.778458 IP 10.4.2.20 > 10.5.0.10: ICMP echo request, id 43677, seq 99, length 64 20:53:24.778582 ARP, Request who-has 10.8.1.20 tell 10.5.0.20, length 46 20:53:25.779665 IP 10.4.2.20 > 10.5.0.10: ICMP echo request, id 43677, seq 100, length 64 20:53:25.779777 ARP, Request who-has 10.8.1.20 tell 10.5.0.20, length 46
I had no idea…
Hmm. Well. It took me a while, but I finally remembered adding an alias to the NIC on the jail host for crey. It looked like this:
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=4219b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC,VLAN_HWTSO> ether 00:25:90:82:21:5a inet 10.5.0.74 netmask 0xffffff00 broadcast 10.5.0.255 inet6 fe80::225:90ff:fe82:215a%em0 prefixlen 64 scopeid 0x1 inet 10.5.0.10 netmask 0xffffffff broadcast 10.5.0.10 inet 10.5.0.102 netmask 0xffffffff broadcast 10.5.0.102 inet 10.5.0.111 netmask 0xffffffff broadcast 10.5.0.111 inet 10.5.0.104 netmask 0xffffffff broadcast 10.5.0.104 inet 10.5.0.105 netmask 0xffffffff broadcast 10.5.0.105 inet 10.5.0.110 netmask 0xffffffff broadcast 10.5.0.110 inet 10.5.0.114 netmask 0xffffffff broadcast 10.5.0.114 inet 10.5.0.20 netmask 0xffffffff broadcast 10.5.0.20 inet 10.5.0.140 netmask 0xffffffff broadcast 10.5.0.140 inet 10.5.0.112 netmask 0xffffffff broadcast 10.5.0.112 inet 10.5.0.13 netmask 0xffffffff broadcast 10.5.0.13 inet 10.5.0.14 netmask 0xffffffff broadcast 10.5.0.14 inet 10.5.0.15 netmask 0xffffffff broadcast 10.5.0.15 inet 10.5.0.106 netmask 0xffffffff broadcast 10.5.0.106 inet 10.5.0.127 netmask 0xff000000 broadcast 255.255.255.255 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect (1000baseT <full-duplex>) status: active
Look at line 20. See how the netmask differs from the others? That’s the cause. I had added the alias with this command:
ifconfig em0 alias 10.5.0.127 255.255.255.255
When I should have used this command:
ifconfig em0 alias 10.5.0.127 netmask 255.255.255.255
I removed the faulty alias with this command:
ifconfig em0 delete 10.5.0.127
And then issued the correct command. Everything ran fine then.
I am positive I encountered this problem before…