Zero-Trust Architecture with Caddy

Mohammed Al-Sahaf

Note: This post is an elaboration on the content of the GitHub repository Zero-Trust, TLS Everywhere Caddy Deployment.

For the sake of fun and research, I decided to attempt building a zero-trust deployment of Caddy on DigitalOcean. It is sensible to assume this is doable given Caddy’s PKI and TLS management capabilities. It also allows us to peek into potential gaps in Caddy if it were to be used to deploy zero-trust infrastructure.

⚠️ The project aims to develop a Terraform deployment where the drafted files are likely to be templates for an HCL configuration file, thus the variables (template placeholders) are likely to be HCL variables interpolated rather than environment variables to be interpreted by Caddy. In other words, the prior note explains the reason for using ${variable_name} rather than {$VARIABLE_NAME}. ⚠️

ℹ️ Throughout this article, you will come across a collection of placeholders interpolated by Terraform that are either defined as variables, come from Terraform resources, or come from Terraform data sources. These are the definitions of the variables defined and used in Terraform:

base_domain: This is the domain that will be used to deploy the infrastructure, e.g. example.com

base_subdomain: This is the sub-domain which each component of the infrastructure is suffixed by. It can be empty or must end with a period .

ca_name: This is the name of the CA defined in the pki app of Caddy and will be used to sign the certificates for the infrastructure. In this article, it’s often referred to as internal_acme.

For the sake of this experiment, the adopted zero-trust definition is the one presented in Smallstep’s Practical Zero Trust saying:

Zero Trust or BeyondProd approaches require authenticated and encrypted communications everywhere. TLS is the cryptographic protocol that powers encryption for all your technologies.

Ah, TLS, so Caddy can do it! To distribute TLS certificates for the various components in our infrastructure, we will need a PKI provider. Fortunately, Caddy comes with a PKI app that allows it to act as a certificate authority. Throughout this tutorial, we will build the configuration files progressively to have a grasp on how Caddy works and how to adapt this experiment to your needs. This article was not written progressively as I was developing the infrastructure; rather it is developed after the fact by retracing my steps and generally avoiding mistakes I made as I was learning.

The infrastructure developed throughout this article will have:

ACME Server, the PKI Provider

The PKI app of Caddy can be configured in the Caddyfile using the pki global option. We will use internal_acme as the referencing name of the internal PKI authority and the user-facing name is Corp ACME Root, and thus the starting Caddyfile is:

1
2
3
4
5
6
7
8
9
{
    # added to avoid installing the CA into the running system
    skip_install_trust
    pki {
        ca ${ca_name} {
            name "Corp ACME Root"
        }
    }
}

By default, Caddy uses the file-system as the storage backend for its data. This is not ideal because it is not scalable due to not being distributed. It allows for the data to be shared amongst our different infrastructure pieces. For this need, we will use the postgres-storage module (Github Repo) by Cory Cooper to store the data into a PostgreSQL database. This requires compiling a custom build of Caddy with the module, which can be easily done using xcaddy. Building the custom Caddy binary is as simple as running the following command:

1
xcaddy build --with github.com/yroc92/postgres-storage

The postgres-storage module supports the Caddyfile unmarshalling, so we can use our existing Caddyfile. Since our goal is to have verified TLS everywhere in our infrastructure, we will set sslmode to verify-full:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
    # added to avoid installing the CA into the running system; not needed.
    skip_install_trust
    pki {
        ca ${ca_name} {
            name "Corp ACME Root"
        }
    }
    storage postgres {
        # sslrootcert defaults to "~/.postgresql/root.crt", but can be provided
        # in the connection string.
        # https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING
        sslmode "verify-full"
        connection_string "postgres://${db_user}:${db_password}@${db_host}:${db_port}/${db_name}?sslmode=verify-full&sslrootcert=${root_cert}"
    }
}

We will add to the PKI provider an ACME server capability to issue certificates to infrastructure components. The ACME server will only handle requests from within the internal network. The ACME server will be configured to use the internal PKI authority we created earlier. We also want to use the same certificate authority to provide the TLS certificate to the ACME server itself, so we’ll add tls directive saying exactly that. Also, let’s have the admin endpoint listen on the private IP address of the node. Because the admin address is a not a wild interface, Caddy will check the Host header of the API requests against the admin endpoint and expect them to match the given list in origins sub-option in the admin global option in the Caddyfile. Unfortunately, this nuance is not explicitly stated in Caddy documentation, but we are remedying this soon. Nonetheless, this instance of Caddy will be reachable by a domain name whose IP address (A record) is the private IP address of the droplet. We need tell Caddy to expect the defined domain name in origins, which it will use for checking the Host header.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
{
    admin ${private_ip}:2019 {
        origins acme.internal.${base_subdomain}${base_domain}:2019
    }
    # added to avoid installing the CA into the running system; not needed.
    skip_install_trust
    pki {
        ca ${ca_name} {
            name "Corp ACME Root"
        }
    }
    storage postgres {
        # sslrootcert defaults to "~/.postgresql/root.crt", but can be provided
        # in the connection string.
        # https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING
        sslmode "verify-full"
        connection_string "postgres://${db_user}:${db_password}@${db_host}:${db_port}/${db_name}?sslmode=verify-full&sslrootcert=${root_cert}"
    }
}
acme.internal.${base_subdomain}${base_domain} {
    tls {
        issuer internal {
            ca ${ca_name}
        }
    }
    @internal remote_ip ${ip_range}
    acme_server @internal {
        ca ${ca_name}
    }
}

The @internal remote_ip ${ip_range} line is called matcher. It tells Caddy that the acme_server handler should only handle requests that match this condition, i.e. the remote IP address (the connection client) falls within the defined IP CIDR. This sets up the guts of the acme_server node. However, there are 2 issues here, and both highlight a gap in Caddy.

The first issue is a catch-22. Downstream clients will fail to connect to the ACME server to obtain a certificate because the certificate ACME server’s certificate authority is still not trusted by them (the client). We need to obtain the certificate authority certificate and add it to the trusted certificate store of the client. We can tell other Caddy instance to also be PKI providers, sharing the same storage to give them access to the defined CA, but it does not feel right because it mixes the roles of the nodes. We can also call the admin endpoint of the acme_server node from the other nodes, but that will be over HTTP and not HTTPS and means there’s a gap of verification at that step. In short, certificates of CAs cannot be shared across Caddy instances without either breaking the security model or sharing the certificate through separate channel. I aimed to address this gap in PR #5784.

For now, the workaround is to download the certificate on the acme_server node using curl and Caddy admin API, then copying the file to the other nodes. Remember that I am developing Terraform templates, so the placeholders are for HCL to interpolate. The --resolve flag is used because Terraform has not created the DNS record yet because the droplet provisioning is not complete, so we cannot rely on DNS to point to our droplet and Caddy is told to do strict origin checking.

1
2
3
4
5
6
7
curl \
    --retry 10 \
    --retry-connrefused \
    --retry-delay 0 \
    --resolve acme.internal.${var.base_subdomain}${var.base_domain}:2019:${self.ipv4_address_private} \
    http://acme.internal.${var.base_subdomain}${var.base_domain}:2019/pki/ca/${var.ca_name} \
     | jq -r .root_certificate > /etc/caddy/ca-root.pem

The second issue is the implicit trust in anyone within the network perimeter to be an ACME client and receive certificates from the acme_server for any domain they want as long as the DNS records for the target domain name is reachable from the network to fulfill the challenge. This is a no-bueno. We only want to issue certificates for a particular subdomain of our domain name that identifies it as internal. The smallstep library Caddy uses for ACME capabilities offers the notion of Certificate Issuance Policies to control certificate issuance. This was not extended into Caddy’s usage of the library, and I tried to address this in PR #5796. Once merged, we will be able to define a policy that only allows issuance of certificates for the specified subdomain, e.g. *.internal.${var.base_subdomain}${var.base_domain}. It is for this reason you will notice that every service is configured with the domain <label>.internal.${var.base_subdomain}${var.base_domain}. Having the ability to configure this policy allows us to piggyback on the ability to update DNS record as the authorization mechanism to receive certificate. It also allows us to limit the scope of a particular ACME server to a particular subdomain, amongst others.

Anyways, in Terraform, I am using the remote provider to copy the content of the downloaded certificate of the CA into memory, subsequently into other nodes and as output.

The (Mock) Application Server

The application server does the hard work of responding to the user request by executing the defined business logic. It may interact with the database to fulfill the incoming request. In our use case, the application server will simply answer with “OK!”. It’s called upstream_server because it’s upstream from the TLS-termination node.

Like the acme_server, the admin endpoint of the application server will be on the private IP address of the droplet and will use PostgreSQL for storage. Hence, the global options section of the Caddyfile will similar to the acme_server node except for the pki configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
{
    admin "${private_ip}:2019" {
        origins app-1.internal.${base_subdomain}${base_domain}:2019
    }
    storage postgres {
        # sslrootcert defaults to "~/.postgresql/root.crt", but can be provided
        # in the connection string.
        # https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING
        sslmode "verify-full"
        connection_string "postgres://${db_user}:${db_password}@${db_host}:${db_port}/${db_name}?sslmode=verify-full&sslrootcert=${root_cert}"
    }
}

The site definition part for this server for this server is rather simple. I’ll summarize them in bullet points so it is easier to follow:

We result with a simple Caddyfile

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
    admin "${private_ip}:2019" {
        origins app-1.internal.${base_subdomain}${base_domain}:2019
    }
    storage postgres {
        # sslrootcert defaults to "~/.postgresql/root.crt", but can be provided
        # in the connection string.
        # https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING
        sslmode "verify-full"
        connection_string "postgres://${db_user}:${db_password}@${db_host}:${db_port}/${db_name}?sslmode=verify-full&sslrootcert=${root_cert}"
    }
}
app-1.internal.${base_subdomain}${base_domain} {
    tls {
        client_auth {
            mode require_and_verify
            trusted_ca_cert_file /etc/caddy/ca-root.pem
        }
        issuer acme {
            dir https://acme.internal.${base_subdomain}${base_domain}/acme/${ca_name}/directory
            trusted_roots /etc/caddy/ca-root.pem
        }
    }
    respond "OK!"
}

There is not much more to it. Of course for this setup to function, there must be a DNS record for app-1.internal.${base_subdomain}${base_domain} pointing to the private IP address of the droplet. It allows this Caddy server to interact with the ACME server to solve the challenge and obtain the certificate.

The Edge, TLS-Terminator

The TLS-terminator is the node that terminates public TLS connections. It is the node that is exposed to the outside world. It will be the node that receives the TLS connection from the user, and will forward it to the application server. To communicate with the application server, AKA upstream_server, it needs to have a client certifiate issued by acme_server to fulfill the mTLS authentication.

Unfortunately, because we have a certificate to procure from a private ACME server and automate without the definition of a website, we have to use JSON for configuration instead of the Caddyfile. This is a very tedious task. Whenever I have to use JSON, I start with a Caddyfile that fulfills the basic need, adapt it, clean up the produced JSON if necessary, then add to it the pieces that are missing. The content of the global options section of the Caddyfile resembles the others we have seen so far, in the sense that it amends the admin endpoint to listen on the private IP address of the droplet, sets an expected origin within the internal subdomain, and configures postgres as the storage backend. For the site definition, we will have to use dummy domain names instead of the placeholders we have been using so far because the adapter is very particular about some of those values as they make a difference in the adaptor branching logic. The site body should merely reverse-proxy the requests to the upstream address with HTTPS scheme, give a domain name identifier for Caddy to automate the client-certificate for, and define the trusted certificate authority that has issued the certificate for the upstream (application) server, i.e. acme_server. Here is the Caddyfile we are starting with:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
{
    admin "${private_ip}:2019" {
        origins edge.internal.${base_subdomain}${base_domain}:2019
    }
    storage postgres {
        # sslrootcert defaults to "~/.postgresql/root.crt", but can be provided
        # in the connection string.
        # https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING
        sslmode "verify-full"
        connection_string "postgres://${db_user}:${db_password}@${db_host}:${db_port}/${db_name}?sslmode=verify-full&sslrootcert=${root_cert}"
    }
}
example.com {
    reverse_proxy https://app-1.internal {
        transport http {
            tls
            tls_client_auth client-cert.com
            tls_trusted_ca_certs /etc/caddy/ca-cert.pem
        }
    }
}

Note that to adapt the Caddyfile, your existing Caddy binary must support the postgres-storage module. If it doesn’t, you may exclude it from the Caddyfile and manually add the storage configuration to the produced JSON, similar to what we will do for the custom configuration of the tls app. Here’s the JSON produced from adapting the earlier Caddyfile and placing the variables where they vary:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
{
    "admin": {
        "listen": "${private_ip}:2019",
        "origins": [
            "edge.internal.${base_subdomain}${base_domain}:2019"
        ]
    },
    "storage": {
        "module": "postgres",
        "connection_string": "postgres://${db_user}:${db_password}@${db_host}:${db_port}/${db_name}?sslmode=verify-full&sslrootcert=${root_cert}"
    },
    "apps": {
        "http": {
            "servers": {
                "srv0": {
                    "listen": [
                        ":443"
                    ],
                    "routes": [
                        {
                            "match": [
                                {
                                    "host": [
                                        "www.${base_subdomain}${base_domain}"
                                    ]
                                }
                            ],
                            "handle": [
                                {
                                    "handler": "subroute",
                                    "routes": [
                                        {
                                            "handle": [
                                                {
                                                    "handler": "reverse_proxy",
                                                    "transport": {
                                                        "protocol": "http",
                                                        "tls": {
                                                            "client_certificate_automate":  "edge.internal.${base_subdomain}${base_domain}",
                                                            "root_ca_pem_files": [
                                                                "/etc/caddy/ca-cert.pem"
                                                            ]
                                                        }
                                                    },
                                                    "upstreams": [
                                                        {
                                                            "dial": "app-1.internal.${base_subdomain}${base_domain}:443"
                                                        }
                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ],
                            "terminal": true
                        }
                    ]
                }
            }
        }
    }
}

To automate the procurement and management of the client certificate from the acme_server, we add configuration for the tls app, inside of which we define an object automation with an array of policies inside it. The array contains 1 policy for the subject edge.internal.${base_subdomain}${base_domain}. The issuer is acme, whose ca URL is https://acme.internal.${base_subdomain}${base_domain}/acme/${ca_name}/directory as we’ve been using earlier. Of course we shouldn’t forget to trust /etc/caddy/ca-root.pem, which we copied from acme_server. We arrive to this JSON configuration file:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
{
    "admin": {
        "listen": "${private_ip}:2019",
        "origins": [
            "edge.internal.${base_subdomain}${base_domain}:2019"
        ]
    },
    "storage": {
        "module": "postgres",
        "connection_string": "postgres://${db_user}:${db_password}@${db_host}:${db_port}/${db_name}?sslmode=verify-full&sslrootcert=${root_cert}"
    },
    "apps": {
        "tls": {
            "automation": {
                "policies": [
                    {
                        "subjects": [
                            "edge.internal.${base_subdomain}${base_domain}"
                        ],
                        "issuers": [
                            {
                                "module": "acme",
                                "ca": "https://acme.internal.${base_subdomain}${base_domain}/acme/${ca_name}/directory",
                                "trusted_roots_pem_files": [
                                    "/etc/caddy/ca-root.pem"
                                ]
                            }
                        ]
                    }
                ]
            }
        },
        "http": {
            "servers": {
                "srv0": {
                    "listen": [
                        ":443"
                    ],
                    "routes": [
                        {
                            "match": [
                                {
                                    "host": [
                                        "www.${base_subdomain}${base_domain}"
                                    ]
                                }
                            ],
                            "handle": [
                                {
                                    "handler": "subroute",
                                    "routes": [
                                        {
                                            "handle": [
                                                {
                                                    "handler": "reverse_proxy",
                                                    "headers": {
                                                        "request": {
                                                            "set": {
                                                                "Host": [
                                                                    "{http.reverse_proxy.upstream.host}"
                                                                ]
                                                            }
                                                        }
                                                    },
                                                    "transport": {
                                                        "protocol": "http",
                                                        "tls": {
                                                            "client_certificate_automate":  "edge.internal.${base_subdomain}${base_domain}",
                                                            "root_ca_pem_files": [
                                                                "/etc/caddy/ca-cert.pem"
                                                            ]
                                                        }
                                                    },
                                                    "upstreams": [
                                                        {
                                                            "dial": "app-1.internal.${base_subdomain}${base_domain}:443"
                                                        }
                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ],
                            "terminal": true
                        }
                    ]
                }
            }
        }
    }
}

I have injected the header manipulation in reverse-proxy because I have trust issues and like to be explicit in this regard. This configuration will start by obtaining a certificate for the domain www.${base_subdomain}${base_domain} from Let’s Encrypt or ZeroSSL, and it will obtain a certificate for the domain edge.internal.${base_subdomain}${base_domain} from the ACME CA.

Conclusion

For all intents and purposes, the experiment to develop fully authenticated infrastructure with mTLS and Caddy being the PKI provider is successful. Caddy is able to facilitate the needs to provide and utilize TLS certificates to deploy zero-trust infrastructure. The experiment is also successful in its purpose to highlight areas of improvement for Caddy to be better PKI provider for full automation of mTLS deployment.

I opted to exclude the Terraform definitions of the droplets, firewalls, database cluster, and DNS record because Terraform is not the focus of this walkthrough. The GitHub repository contains runnable Terraform project of the infrastructure. If you’re interested in seeing the full project, which I have deployed numerous times as I developed this post, visit the GitHub repository, play with the code, and let me know what you think.