I use bind (Berkeley Internet Name Domain) as my DNS server. I am currently running bind 9.11.5P1 on FreeBSD 11.2-RELEASE-p8 in a jail, with iocage as my jail manager. The OS, jail, and jail manager should play no part in how this works.
I have been collecting statistics from bind for some time. I have configured LibreNMS to collect the details via snmpd and they are plotted in a lovely looking graph.
The problem I have been trying to solve: the statistics stop from time to time.
My first tweet about this problem was on 25 Dec 2018. I started to suspect my Ansible configuration, but that was quickly disproved. Much information has been gathered over the past 48 hours, with sufficient complexity to warrant writing about it.
Please note: some of the times shown might not match the timeline of the narrative. That is because I was reproducing the issue today while writing this.
Why do they stop?
The statistics would suddenly stop without any obvious reason. I would fix the permissions, but there would be a gap in the data.
It took me a while to figure out why the statistics stopped, but the answer was right there in the log files:
16-Jan-2019 17:30:12.035 received control channel command 'stats' 16-Jan-2019 17:30:12.044 could not open statistics dump file '/var/run/named/stats': permission denied 16-Jan-2019 17:30:12.044 dumpstats failed: permission denied
During testing today, I could reproduce the issue and looking at the file, I found:
[dan@toiler:~] $ ls -l /var/run/named/ total 11 -rw-r--r-- 1 bind bind 6 Jan 21 15:16 pid -rw------- 1 bind bind 102 Jan 21 15:16 session.key -rw-r--r-- 1 root bind 1 Jan 21 15:20 stats
Why does the file change permissions to chown root:bind?
Let’s fix the permissions
The permissions are fixed easily:
[dan@toiler:~] $ sudo chown bind /var/run/named/stats [dan@toiler:~] $ ls -l /var/run/named/stats -rw-r--r-- 1 bind bind 1 Jan 21 15:20 /var/run/named/stats
When the next 5-minute interval comes along, the stats are updated:
$ tail -F /var/log/named/general.log 21-Jan-2019 15:25:21.923 received control channel command 'stats' 21-Jan-2019 15:25:21.923 dumpstats complete
OK, we’re good.
Until we’re not.
By 6 January, it happened again.
Again on 18 January, and that’s when I really started getting into this issue.
The problem kept recurring and I kept losing graphing data. I went so far as to create a Nagios plugin to check the permissions so I would at least know there was an issue instead of waiting until I checked the graphs.
The hunt
The problem would recur from time to time and the file would again have the wrong permissions. The bind port itself came under what turned out to be a false accusation. This arose because I noticed that the graph stoppage corresponded with a pkg upgrade.
That red herring moved quickly out of focus and I discovered that restart the jail would cause the /var/run/named/stats file to disappear.
I then turned to more experiments and discovered something rather interesting: the file disappears on reboot.
The file disappears on reboot
What?
Yes, /var/run/named/stats disappeared after restarting the jail.
Why?
To test that this file was not being singled out, I did some tests, and found that /var/run/TESTING/file disappears on reboot.
WTF?
I asked on a FreeBSD IRC channel and the first reply was:
[Jan 20 14:57] <@dvl> Does the base system have something which could empty the contents of a directory in /var/run/TESTING/ ? I ask because that is what I'm seeing in a jail under both 11.2 and 12.0. [Jan 20 14:57] <@dvl> Change TESTING to whatever value you want, and it happens [Jan 20 14:58] <cem> dvl: cleanvar? [Jan 20 14:58] <cem> (/etc/rc.d/cleanvar) [Jan 20 14:58] <@dvl> cem: well, bugger me: cleanvar_enable="YES" # Clean the /var directory [Jan 20 14:59] <cem> 'find /var/run ( -type f ... ) -delete'
This sounded familiar but I could not recall anything about it.
Since this was the default action on all my systems, I was not going to change it now.
At least now I knew why the file was disappearing.
This meant that named was creating the file with chown root:bind, not that something was changing the owner.
Why named, why?
My post to the BIND Users Mailing List was written while I was composing this blog post. My current goal is to move the file outside /var/run/ directory and into /var/db/named/ but I do not want to deploy this solution to my other servers until I find out more about this from the BIND project.
A working solution
I have a working solution, but it is not deployed.
First, i create the new directory, owned by bind:
mkdir /var/db/named/ chown bind:bind /var/db/named/
Then I change the logging configuration setting:
$ grep stats /usr/local/etc/namedb/named.conf statistics-file "/var/db/named/stats";
I restarted the jail and looked in the directories:
[dan@toiler:~] $ ls -l /var/run/named/ /var/db/named/ /var/db/named/: total 0 /var/run/named/: total 10 -rw-r--r-- 1 bind bind 6 Jan 21 16:00 pid -rw------- 1 bind bind 102 Jan 21 16:00 session.key [dan@toiler:~] $
Then I monitored the logs:
[dan@toiler:~] $ tail -F /var/log/named/general.log 21-Jan-2019 16:00:52.800 zone 7.e.f.ip6.arpa/IN: loaded serial 42 21-Jan-2019 16:00:52.801 zone 8.e.f.ip6.arpa/IN: loaded serial 42 21-Jan-2019 16:00:52.801 zone d.e.f.ip6.arpa/IN: loaded serial 42 21-Jan-2019 16:00:52.801 zone e.e.f.ip6.arpa/IN: loaded serial 42 21-Jan-2019 16:00:52.801 zone f.e.f.ip6.arpa/IN: loaded serial 42 21-Jan-2019 16:00:52.801 zone localhost/IN: loaded serial 42 21-Jan-2019 16:00:52.801 all zones loaded 21-Jan-2019 16:00:52.801 running 21-Jan-2019 16:05:23.991 received control channel command 'stats' 21-Jan-2019 16:05:23.991 dumpstats complete
Success. Let’s see the files:
[dan@toiler:~] $ ls -l /var/run/named/ /var/db/named/ /var/db/named/: total 10 -rw-r--r-- 1 bind bind 24777 Jan 21 16:15 stats /var/run/named/: total 11 -rw-r--r-- 1 bind bind 6 Jan 21 16:00 pid -rw------- 1 bind bind 102 Jan 21 16:00 session.key -rw-r--r-- 1 root bind 1 Jan 21 16:15 stats [dan@toiler:~] $
Exactly as they should be.
I’ll run with that configuration on this one server for now.
Again, why?
Why does the file get created chown root:bind in /var/run/named/ but chown bind:bind in /var/db/named/?
The two directories have identical permissions:
[dan@toiler:~] $ ls -ld /var/run/named/ /var/db/named/ drwxr-xr-x 2 bind bind 3 Jan 21 16:05 /var/db/named/ drwxr-xr-x 2 bind bind 5 Jan 21 16:05 /var/run/named/
Very odd to me.
Edit 2019-01-22 Found it
One critical piece I had forgotten: who/what creates the stats file?
It is a call to rndc stats.
Who does that?
snmpd via the LibreNMS bind application.
snmpd runs as root on my systems.
I looked at the code and here is my patch. If a bind-user is specified, the commands will be run as that user.
I am sure that the file was created via echo >, which run as root, creates a root-owned file, which named, running as bind, can not write to. This is what caused the error.
I’ve finally put this patch onto my DNS services. I won’t know for sure that they work until after the next reboot.
They seem to be working fine.
I just fixed this on 6 named servers. Seems I never rolled this out.