Postfix suddenly starts rejecting email it had been accepting

Let’s Encrypt is an easy way to get free SSL certificates in an automated manner. You may never have to manually do another cert renewal again.

Last night, I received this email:

From: Cron Daemon 
To: dan@langille.org
Subject: Cron  /usr/local/bin/cert-puller
Date: Fri, 23 Feb 2018 23:57:00 +0000 (UTC)

/etc/rc.conf: 3: not found
/etc/rc.conf: yr: not found
/etc/rc.conf: 3: not found
/etc/rc.conf: yr: not found

Little did I know when I tweeted about it, that I would be writing a blog post the next morning.

What is that message? That is output from a cronjob running the cert-puller script (more blog posts anvil). It was downloading a new certificate and installing it. Or rather, it was trying. It failed because there was an error in /etc/rc.conf. When I ssh‘d to cliff2 and checked the file, yes, right there at the top was:

3
yr

I deleted those extraneous lines and checked Nagios. I was sure some email would have been caught up in this. cliff2 is one of two outgoing email servers in my home lab.

I noticed a bunch of outgoing test emails, like this one:

$ mailq
-Queue ID-  --Size-- ----Arrival Time---- -Sender/Recipient-------
22BB058BE       541 Sat Feb 24 04:19:13  no-reply-testing@nagios.int.langille.org
(host testing.langille.org[192.0.2.68] said: 454 4.7.1 : Recipient address rejected: Access denied (in reply to RCPT TO command))
                                         dan@testing.langille.org

That’s odd. Why would it be rejecting that?

NOTE: I have massaged most hostnames and IP addresses in this post. If something in the logs / emails seems inconsistent, that is probably why.

Some background first: that email is part of email testing that Nagios runs every five minutes. It sends a test email to my IMAP server, reads it, then deletes it. I run it to verify the delivery mechanisms are working. For more information on that, see Testing email delivery.

But why reject?

I was unsure as to why it accepted an email then start rejecting them. I looked for a Postfix restart, thinking that the configuration had changed. No, no restart. This is what I found in /var/log/maillog on testing:

accepted

Feb 23 23:53:13 testing private5587/smtpd[84128]: connect from nagios.int.langille.org[192.0.2.34]
Feb 23 23:53:14 testing private5587/smtpd[84128]: 0FFBD3F29: client=nagios.int.langille.org[192.0.2.34]
Feb 23 23:53:14 testing postfix/cleanup[84174]: 0FFBD3F29: message-id=<1519429993.kct4v7t8jc56kwld.checksmtpsend@nagios.int.langille.org>
Feb 23 23:53:14 testing postfix/qmgr[96832]: 0FFBD3F29: from=, size=779, nrcpt=1 (queue active)
Feb 23 23:53:14 testing private5587/smtpd[84128]: disconnect from nagios.int.langille.org[192.0.2.34] ehlo=2 starttls=1 mail=1 rcpt=1 data=1 quit=1 commands=7

rejected

Feb 23 23:58:13 testing private5587/smtpd[84834]: connect from nagios.int.langille.org[192.0.2.34]
Feb 23 23:58:14 testing private5587/smtpd[84834]: NOQUEUE: reject: RCPT from nagios.int.langille.org[192.0.2.34]: 454 4.7.1 : Recipient address rejected: Access denied; from= to= proto=ESMTP helo=
Feb 23 23:58:14 testing private5587/smtpd[84834]: disconnect from nagios.int.langille.org[192.0.2.34] ehlo=2 starttls=1 mail=1 rcpt=0/1 data=0/1 rset=1 quit=1 commands=6/8

There were no restarts / etc of postfix, so what happened here?

I wasn’t sure *why* the email was being rejected, but I was heading out to dinner and this was not urgent.

The next morning, I had a look at the output of postconf -n on testing. The one directive which stood out was:

relay_clientcerts = hash:/usr/local/etc/relay_clientcerts

OH!

The original email was about a new certificate. The certificate on cliff2 was updated. That means the fingerprint values in relay_clientcerts on all the servers are now outdated.

When did that email occur? 23:57. Right in between the above two email tests.

It wasn’t something on the testing host which had changed. It was something on the sending host.

What’s in relay_clientcerts?

From Postfix Configuration Parameters, an example entry in the relay_clientcerts file is:

D7:04:2F:A7:0B:8C:A5:21:FA:31:77:E1:41:8A:EE:80 lutzpc.at.home

How do you generate one of these fingerprint? See http://www.postfix.org/postconf.5.html#smtpd_tls_fingerprint_digest

Automate all the things

I want to automate this. So lets. Here is the script I have so far:

$ cat ~/relay-cert-fingerprints.sh 
#!/bin/sh

CERTDIR="/var/db/certs-for-rsync/certs/"
CERTS="cliff.unixathome.org"

OPENSSL="/usr/bin/openssl"
SED="/usr/bin/sed"

for cert in ${CERTS}
do
  fingerprint=`${OPENSSL} x509 -noout -fingerprint -md5 -inform pem -in ${CERTDIR}/${cert}/${cert}.cer | ${SED} 's/MD5 Fingerprint=//g'`
  echo ${fingerprint} ${cert}
done

CERTDIR is the directory where acme.sh stores all my certs, one certificate per directory.

When run the command, I get:

$ ~/relay-cert-fingerprints.sh 
D7:04:2F:A7:0B:8C:A5:21:FA:31:77:E1:41:8A:EE:80 cliff.example.org

Why cliff.example.org and not cliff2.example.org? cliff1 and cliff2 are the hostnames. cliff.example.org resolves to two IP addresses, cliff1 and cliff2. Both hosts present the same cliff.example.org certificate. This allows outgoing mail to be relayed via either host.

Remaining steps

There are a few remaining steps before distribution of relay_clientcerts is automated:

Create a temporary file with all the newly generated fingerprints
Compare the temporary file with the version which was last distributed
If different, save the temp file away for distribution

Distributing the file can be achieved using the same methods as the certificates are distributed via anvil.

I am not sure if I should amend cert-puller or create a new script just for postfix configuration files. The issue is closely related to certificates, as the file contains certificate fingerprints.

I am leaning towards creating a new script. Oh damn, just now, I thought of calling it cert-fingerprint-puller or perhaps
cert-finger-puller….

The script would be nearly identical to cert-puller, but would need to run on a fairly regular basis.

cert-puller can be run daily, because a client is in no rush to update the cert. There are many days of wiggle time, with Nagios checks to alert me to any certs which fail to update in a timely manner.