Zero-Trust Architecture with Caddy
Note: This post is an elaboration on the content of the GitHub repository Zero-Trust, TLS Everywhere Caddy Deployment.
For the sake of fun and research, I decided to attempt building a zero-trust deployment of Caddy on DigitalOcean. It is sensible to assume this is doable given Caddy’s PKI and TLS management capabilities. It also allows us to peek into potential gaps in Caddy if it were to be used to deploy zero-trust infrastructure.
⚠️ The project aims to develop a Terraform deployment where the drafted files are likely to be templates for an HCL configuration file, thus the variables (template placeholders) are likely to be HCL variables interpolated rather than environment variables to be interpreted by Caddy. In other words, the prior note explains the reason for using ${variable_name}
rather than {$VARIABLE_NAME}
. ⚠️
ℹ️ Throughout this article, you will come across a collection of placeholders interpolated by Terraform that are either defined as variables, come from Terraform resources, or come from Terraform data sources. These are the definitions of the variables defined and used in Terraform:
base_domain
: This is the domain that will be used to deploy the infrastructure, e.g.example.com
base_subdomain
: This is the sub-domain which each component of the infrastructure is suffixed by. It can be empty or must end with a period.
ca_name
: This is the name of the CA defined in thepki
app of Caddy and will be used to sign the certificates for the infrastructure. In this article, it’s often referred to asinternal_acme
.
For the sake of this experiment, the adopted zero-trust definition is the one presented in Smallstep’s Practical Zero Trust saying:
Zero Trust or BeyondProd approaches require authenticated and encrypted communications everywhere. TLS is the cryptographic protocol that powers encryption for all your technologies.
Ah, TLS, so Caddy can do it! To distribute TLS certificates for the various components in our infrastructure, we will need a PKI provider. Fortunately, Caddy comes with a PKI app that allows it to act as a certificate authority. Throughout this tutorial, we will build the configuration files progressively to have a grasp on how Caddy works and how to adapt this experiment to your needs. This article was not written progressively as I was developing the infrastructure; rather it is developed after the fact by retracing my steps and generally avoiding mistakes I made as I was learning.
The infrastructure developed throughout this article will have:
- PostgreSQL database cluster acting as shared storage
acme_server
: an ACME server acting as the PKI provider for the all nodes in the infrastructure, including itselfupstream_server
: an “application” server, which simply replies with “OK!” for the purpose of this experiment, that is served by the Caddy server over HTTPS whose TLS certificate is issued by theacme_server
; it validates requires incoming connections to be verified with mTLS and the certificate of the client TLS to be issued byacme_server
edge_server
: the node is the TLS-termination node facing the Internet, and it reverse-proxies the requests to theupstream_server
over HTTPS. It procures a certificate fromacme_server
to use for mTLS with theupstream_server
.
ACME Server, the PKI Provider
The PKI app of Caddy can be configured in the Caddyfile using the pki
global option. We will use internal_acme
as the referencing name of the internal PKI authority and the user-facing name is Corp ACME Root
, and thus the starting Caddyfile is:
|
|
By default, Caddy uses the file-system as the storage backend for its data. This is not ideal because it is not scalable due to not being distributed. It allows for the data to be shared amongst our different infrastructure pieces. For this need, we will use the postgres-storage module (Github Repo) by Cory Cooper to store the data into a PostgreSQL database. This requires compiling a custom build of Caddy with the module, which can be easily done using xcaddy. Building the custom Caddy binary is as simple as running the following command:
|
|
The postgres-storage module supports the Caddyfile unmarshalling, so we can use our existing Caddyfile. Since our goal is to have verified TLS everywhere in our infrastructure, we will set sslmode
to verify-full
:
|
|
We will add to the PKI provider an ACME server capability to issue certificates to infrastructure components. The ACME server will only handle requests from within the internal network. The ACME server will be configured to use the internal PKI authority we created earlier. We also want to use the same certificate authority to provide the TLS certificate to the ACME server itself, so we’ll add tls
directive saying exactly that. Also, let’s have the admin endpoint listen on the private IP address of the node. Because the admin
address is a not a wild interface, Caddy will check the Host
header of the API requests against the admin endpoint and expect them to match the given list in origins
sub-option in the admin
global option in the Caddyfile. Unfortunately, this nuance is not explicitly stated in Caddy documentation, but we are remedying this soon. Nonetheless, this instance of Caddy will be reachable by a domain name whose IP address (A record) is the private IP address of the droplet. We need tell Caddy to expect the defined domain name in origins
, which it will use for checking the Host
header.
|
|
The @internal remote_ip ${ip_range}
line is called matcher. It tells Caddy that the acme_server
handler should only handle requests that match this condition, i.e. the remote IP address (the connection client) falls within the defined IP CIDR. This sets up the guts of the acme_server
node. However, there are 2 issues here, and both highlight a gap in Caddy.
The first issue is a catch-22. Downstream clients will fail to connect to the ACME server to obtain a certificate because the certificate ACME server’s certificate authority is still not trusted by them (the client). We need to obtain the certificate authority certificate and add it to the trusted certificate store of the client. We can tell other Caddy instance to also be PKI providers, sharing the same storage to give them access to the defined CA, but it does not feel right because it mixes the roles of the nodes. We can also call the admin endpoint of the acme_server
node from the other nodes, but that will be over HTTP
and not HTTPS
and means there’s a gap of verification at that step. In short, certificates of CAs cannot be shared across Caddy instances without either breaking the security model or sharing the certificate through separate channel. I aimed to address this gap in PR #5784.
For now, the workaround is to download the certificate on the acme_server
node using curl
and Caddy admin API, then copying the file to the other nodes. Remember that I am developing Terraform templates, so the placeholders are for HCL to interpolate. The --resolve
flag is used because Terraform has not created the DNS record yet because the droplet provisioning is not complete, so we cannot rely on DNS to point to our droplet and Caddy is told to do strict origin checking.
|
|
The second issue is the implicit trust in anyone within the network perimeter to be an ACME client and receive certificates from the acme_server
for any domain they want as long as the DNS records for the target domain name is reachable from the network to fulfill the challenge. This is a no-bueno. We only want to issue certificates for a particular subdomain of our domain name that identifies it as internal. The smallstep library Caddy uses for ACME capabilities offers the notion of Certificate Issuance Policies to control certificate issuance. This was not extended into Caddy’s usage of the library, and I tried to address this in PR #5796. Once merged, we will be able to define a policy that only allows issuance of certificates for the specified subdomain, e.g. *.internal.${var.base_subdomain}${var.base_domain}
. It is for this reason you will notice that every service is configured with the domain <label>.internal.${var.base_subdomain}${var.base_domain}
. Having the ability to configure this policy allows us to piggyback on the ability to update DNS record as the authorization mechanism to receive certificate. It also allows us to limit the scope of a particular ACME server to a particular subdomain, amongst others.
Anyways, in Terraform, I am using the remote
provider to copy the content of the downloaded certificate of the CA into memory, subsequently into other nodes and as output.
The (Mock) Application Server
The application server does the hard work of responding to the user request by executing the defined business logic. It may interact with the database to fulfill the incoming request. In our use case, the application server will simply answer with “OK!”. It’s called upstream_server
because it’s upstream from the TLS-termination node.
Like the acme_server
, the admin endpoint of the application server will be on the private IP address of the droplet and will use PostgreSQL for storage. Hence, the global options section of the Caddyfile will similar to the acme_server
node except for the pki
configuration:
|
|
The site definition part for this server for this server is rather simple. I’ll summarize them in bullet points so it is easier to follow:
- It will declare the domain name
app-1.internal.${base_subdomain}${base_domain}
- Tell Caddy to use our ACME server (i.e.
https://acme.internal.${base_subdomain}${base_domain}/acme/${ca_name}/directory
) for the TLS certificate- Trust the CA file which we’ve semi-manually copied into the node
- Require mTLS with successful verification for the client certificate to have been issued by
acme_server
- Respond with “OK!”
We result with a simple Caddyfile
|
|
There is not much more to it. Of course for this setup to function, there must be a DNS record for app-1.internal.${base_subdomain}${base_domain}
pointing to the private IP address of the droplet. It allows this Caddy server to interact with the ACME server to solve the challenge and obtain the certificate.
The Edge, TLS-Terminator
The TLS-terminator is the node that terminates public TLS connections. It is the node that is exposed to the outside world. It will be the node that receives the TLS connection from the user, and will forward it to the application server. To communicate with the application server, AKA upstream_server
, it needs to have a client certifiate issued by acme_server
to fulfill the mTLS authentication.
Unfortunately, because we have a certificate to procure from a private ACME server and automate without the definition of a website, we have to use JSON for configuration instead of the Caddyfile. This is a very tedious task. Whenever I have to use JSON, I start with a Caddyfile that fulfills the basic need, adapt it, clean up the produced JSON if necessary, then add to it the pieces that are missing. The content of the global options section of the Caddyfile resembles the others we have seen so far, in the sense that it amends the admin endpoint to listen on the private IP address of the droplet, sets an expected origin within the internal subdomain, and configures postgres as the storage backend. For the site definition, we will have to use dummy domain names instead of the placeholders we have been using so far because the adapter is very particular about some of those values as they make a difference in the adaptor branching logic. The site body should merely reverse-proxy the requests to the upstream address with HTTPS scheme, give a domain name identifier for Caddy to automate the client-certificate for, and define the trusted certificate authority that has issued the certificate for the upstream (application) server, i.e. acme_server
. Here is the Caddyfile we are starting with:
|
|
Note that to adapt the Caddyfile, your existing Caddy binary must support the postgres-storage module. If it doesn’t, you may exclude it from the Caddyfile and manually add the storage configuration to the produced JSON, similar to what we will do for the custom configuration of the tls
app. Here’s the JSON produced from adapting the earlier Caddyfile and placing the variables where they vary:
|
|
To automate the procurement and management of the client certificate from the acme_server
, we add configuration for the tls
app, inside of which we define an object automation
with an array of policies
inside it. The array contains 1 policy for the subject edge.internal.${base_subdomain}${base_domain}
. The issuer
is acme
, whose ca
URL is https://acme.internal.${base_subdomain}${base_domain}/acme/${ca_name}/directory
as we’ve been using earlier. Of course we shouldn’t forget to trust /etc/caddy/ca-root.pem
, which we copied from acme_server
. We arrive to this JSON configuration file:
|
|
I have injected the header manipulation in reverse-proxy because I have trust issues and like to be explicit in this regard. This configuration will start by obtaining a certificate for the domain www.${base_subdomain}${base_domain}
from Let’s Encrypt or ZeroSSL, and it will obtain a certificate for the domain edge.internal.${base_subdomain}${base_domain}
from the ACME CA.
Conclusion
For all intents and purposes, the experiment to develop fully authenticated infrastructure with mTLS and Caddy being the PKI provider is successful. Caddy is able to facilitate the needs to provide and utilize TLS certificates to deploy zero-trust infrastructure. The experiment is also successful in its purpose to highlight areas of improvement for Caddy to be better PKI provider for full automation of mTLS deployment.
I opted to exclude the Terraform definitions of the droplets, firewalls, database cluster, and DNS record because Terraform is not the focus of this walkthrough. The GitHub repository contains runnable Terraform project of the infrastructure. If you’re interested in seeing the full project, which I have deployed numerous times as I developed this post, visit the GitHub repository, play with the code, and let me know what you think.