Bad certificates with joining nodes

Hi! I am trying to get our kubernetes cluster nodes registered with a teleport proxy/auth service (I’ve tried 4.2.9 and 4.3.6) outside the cluster so we can easily and securely log in to those nodes. Everything is in AWS, the nodes are in a private subnet and the proxy/auth in a public subnet of the same VPC. The teleport proxy is running under Let’sEncrypt certs for the public facing traffic. Our VPC-internal security groups allow all traffic on all ports. I can “normal” ssh to nodes from the proxy/auth server, and it can forward other connections into the VPC just fine.

But I just can’t get nodes registered with the proxy/auth server. Whatever I seem to try I get bad certificate, signed by an unknown ca, no SAN/IPs in cert, or related errors on both server and node. I’ve tried quite a few config combinations, and plenty of deletions of /var/lib/teleport on both the auth server and the nodes. Always fails.

Here’s my current proxy/auth config in our testing zone:

teleport:
  nodename: teleport.playground.bond.tech
  data_dir: /var/lib/teleport
  connection_limits:
    max_connections: 1000
    max_users: 250
  log:
    output: stderr
    severity: ERROR
  storage:
    type: dir
auth_service:
  authentication: 
    type: github
  enabled: yes
  listen_addr: 0.0.0.0:3025
  public_addr: teleport.playground.bond.tech:3025
  tokens: 
  - "node:supersecrettoken-playground"
ssh_service:
  enabled: yes
  listen_addr: 0.0.0.0:3022
  public_addr: teleport.playground.bond.tech:3022 
  labels:
    role: master
    type: postgres
  commands:
  - name: arch
    command: [/usr/bin/uname, -p]
    period: 1h0m0s
proxy_service:
  enabled: yes
  listen_addr: 0.0.0.0:3023
  tunnel_listen_addr: 0.0.0.0:3024
  web_listen_addr: 0.0.0.0:3080
  https_key_file: /etc/letsencrypt/live/teleport.playground.bond.tech/privkey.pem
  https_cert_file: /etc/letsencrypt/live/teleport.playground.bond.tech/fullchain.pem
  kubernetes: 
    enabled: yes
    public_addr: teleport.playground.bond.tech:3026 
    listen_addr: 0.0.0.0:3026
    kubeconfig_file: /home/teleport/.kube/config

And here’s the node config:

teleport:
  auth_token: supersecrettoken-playground
  auth_servers:
    - 10.5.0.59:3025
  data_dir: /var/lib/teleport
  connection_limits:
    max_connections: 1000
    max_users: 250
  log:
    output: stderr
    severity: ERROR
  storage:
    type: dir
ssh_service:
  enabled: yes
  listen_addr: 0.0.0.0:3022
  labels:
    role: node
    cluster: playground1
    type: postgres
  commands: 
  - name: instance-id
    command: ["curl","http://169.254.169.254/latest/meta-data/instance-id"]
    period: 1h0m0s
  - name: instance-type
    command: ["curl","http://169.254.169.254/latest/meta-data/instance-type"]
    period: 1h0m0s
  - name: zone
    command: ["curl","http://169.254.169.254/latest/meta-data/placement/availability-zone"]
    period: 1h0m0s

though I’ve tried running teleport directly, with the same failures.

Errors on the node:

Proxy failed to establish connection to cluster: x509: certificate signed by unknown authority. 
Node failed to establish connection to cluster: Get https://teleport.playground.bond.tech:3025/v1/webapi/find: x509: certificate signed by unknown

Same events on the proxy/auth server:

Sep 28 23:54:38 ip-10-5-0-59.us-east-2.compute.internal teleport[9966]: http: TLS handshake error from 3.22.56.43:63794: remote error: tls: bad certificate

Also seeing this error in 4.3.6:

Sep 28 23:54:59 ip-10-5-0-59.us-east-2.compute.internal teleport[9966]: ERRO [AUTH:1]    "Failed to retrieve client pool. Client cluster ip-10-5-1-154.us-east-2.compute.internal, target cluster teleport.playground.bond.tech, error:  \nERROR REPORT:\nOriginal Error: *trace.NotFoundError key /authorities/host/ip-10-5-1-154.us-east-2.compute.internal is not found\nStack Trace:\n\t/go/src/github.com/gravitational/teleport/lib/backend/lite/lite.go:596 github.com/gravitational/teleport/lib/backend/lite.(*LiteBackend).getInTransaction\n\t/go/src/github.com/gravitational/teleport/lib/backend/lite/lite.go:570 github.com/gravitational/teleport/lib/backend/lite.(*LiteBackend).Get.func1\n\t/go/src/github.com/gravitational/teleport/lib/backend/lite/lite.go:867 github.com/gravitational/teleport/lib/backend/lite.(*LiteBackend).inTransaction\n\t/go/src/github.com/gravitational/teleport/lib/backend/lite/lite.go:569 github.com/gravitational/teleport/lib/backend/lite.(*LiteBackend).Get\n\t/go/src/github.com/gravitational/teleport/lib/backend/sanitize.go:97 github.com/gravitational/teleport/lib/backend.(*Sanitizer).Get\n\t/go/src/github.com/gravitational/teleport/lib/backend/report.go:130 github.com/gravitational/teleport/lib/backend.(*Reporter).Get\n\t/go/src/github.com/gravitational/teleport/lib/services/local/trust.go:207 github.com/gravitational/teleport/lib/services/local.(*CA).GetCertAuthority\n\t/go/src/github.com/gravitational/teleport/lib/cache/cache.go:543 github.com/gravitational/teleport/lib/cache.(*Cache).GetCertAuthority\n\t/go/src/github.com/gravitational/teleport/lib/auth/middleware.go:366 github.com/gravitational/teleport/lib/auth.ClientCertPool\n\t/go/src/github.com/gravitational/teleport/lib/auth/middleware.go:162 github.com/gravitational/teleport/lib/auth.(*TLSServer).GetConfigForClient\n\t/opt/go/src/crypto/tls/handshake_server.go:147 crypto/tls.(*Conn).readClientHello\n\t/opt/go/src/crypto/tls/handshake_server.go:43 crypto/tls.(*Conn).serverHandshake\n\t/opt/go/src/crypto/tls/conn.go:1364 crypto/tls.(*Conn).Handshake\n\t/opt/go/src/net/http/server.go:1783 net/http.(*conn).serve\n\t/opt/go/src/runtime/asm_amd64.s:1358 runtime.goexit\nUser Message: key /authorities/host/ip-10-5-1-154.us-east-2.compute.internal is not found\n." auth/middleware.go:170

I know this topic has come up a bit in the threads here, but it isn’t resolved.

  1. I appreciate that you’ve tried a few different combinations, but I think for this to work you’ll need to either change your node’s /etc/teleport.yaml file to connect to the auth server’s FQDN:
 auth_servers:
    - teleport.playground.bond.tech:3025

or you’ll need to configure your Teleport auth/proxy server to have the same IP that’s being used in the node’s config as its public_addr (and then restart):

auth_service:
  public_addr: 10.5.0.59:3025

Basically the two need to match exactly. Make sure that you shut down Teleport completely on the node, rm -rf /var/lib/teleport and then restart after changing the config file. You might also want to set severity: DEBUG in the log: config on both ends, as this should give a lot more information about exactly what’s failing.

  1. As an extra step for security, you should also get your CA pin hash from the Teleport auth server using tctl status and add this into your node config file - it’s used to make sure that the node is definitely connecting to the right machine (and hasn’t been subject to any kind of MITM on the traffic, etc):

Auth/proxy server:

$ sudo tctl status
Cluster  teleport.example.com
Version  4.3.6
User CA  never updated
Host CA  never updated
CA pin   sha256:5abdd3a143a230fd31c9706d668bba3ee25a6e0eec54fcd69680c1ec0530fe9bd

Then add this into the node config:

teleport:
  ca_pin: sha256:5abdd3a143a230fd31c9706d668bba3ee25a6e0eec54fcd69680c1ec0530fe9bd

Good luck. Let me know how you get on and we can try some other things if this doesn’t work.