Failed to add a new node to an existing cluster

Version: Teleport Enterprise v4.2.8git:v4.2.8-0-ga9015b33 go1.13.2

We already have a teleport auth server set up on one aws ec2 instance and it working fine. I was trying to add add another ec2 instance as a node.

On auth instance, I execute /usr/local/bin/tctl nodes add --ttl=30m --roles=node.

The invite token: xxx
This token will expire in 30 minutes

Run this on the new node to join the cluster:

> teleport start \
   --roles=node \
   --token=xxx \
   --ca-pin=sha256:xxx \
   --auth-server=teleport-auth.xx.xx.xx:3025

Please note:

  - This invitation token will expire in 30 minutes
  - teleport-auth.xx.xx.xx:3025 must be reachable from the new node

Then I execute the command on the other ec2, which I intend to add to the cluster

sudo teleport start    --roles=node    --token=xxx    --ca-pin=sha256:xxx    --auth-server=teleport-auth.xxx.xxx.xx:3025 --debug

It seemed to connect in the beginning but then the heartbeat failed:

INFO             Using auth server: tcp://teleport-auth.xx.xx.xx:3025 config/configuration.go:945
DEBU [SQLITE]    Connected to: file:/var/lib/teleport/proc/sqlite.db?_busy_timeout=10000&_sync=OFF, poll stream period: 1s lite/lite.go:173
DEBU [SQLITE]    Synchronous: 0, busy timeout: 10000 lite/lite.go:220
DEBU [KEYGEN]    SSH cert authority started with no keys pre-compute. native/native.go:107
DEBU [PROC]      Adding service to supervisor. service:register.node sync/mutex.go:74
DEBU [PROC]      Adding service to supervisor. service:ssh.node sync/mutex.go:74
DEBU [PROC]      Adding service to supervisor. service:ssh.shutdown sync/mutex.go:74
DEBU [PROC]      Adding service to supervisor. service:common.rotate sync/mutex.go:74
DEBU [PROC:1]    Service has started. service:ssh.node service/supervisor.go:242
DEBU [PROC:1]    Service has started. service:register.node service/supervisor.go:242
DEBU [PROC:1]    Service has started. service:ssh.shutdown service/supervisor.go:242
DEBU [PROC:1]    Service has started. service:common.rotate service/supervisor.go:242
DEBU [PROC:1]    No signal pipe to import, must be first Teleport process. service/service.go:706
DEBU [PROC:1]    Connected state: never updated. service/connect.go:99
INFO [PROC]      Connecting to the cluster prod with TLS client certificate. service/connect.go:127
DEBU [PROC]      Attempting to connect to Auth Server directly. service/connect.go:795
DEBU [PROC]      Connected to Auth Server with direct connection. service/connect.go:813
DEBU [PROC:1]    Connected client: Identity(Node, cert(3526f32c-82ee-40ab-9e3c-be315a19c44b.prod issued by prod:245912924890472801969688744304368047006),trust root(prod:245912924890472801969688744304368047006)) service/connect.go:81
DEBU [PROC:1]    Connected server: Identity(Node, cert(3526f32c-82ee-40ab-9e3c-be315a19c44b.prod issued by prod:245912924890472801969688744304368047006),trust root(prod:245912924890472801969688744304368047006)) service/connect.go:82
DEBU [PROC]      Adding service to supervisor. service:auth.client.node sync/mutex.go:74
DEBU [PROC:1]    Broadcasting event. event:SSHIdentity service/supervisor.go:332
DEBU [PROC:1]    Service is completed and removed. service:register.node service/supervisor.go:219
DEBU [PROC:1]    Service has started. service:auth.client.node service/supervisor.go:242
DEBU [NODE:1]    Received event "SSHIdentity". service/service.go:1404
DEBU [BPF]       Enhanced session recording is not enabled, skipping. bpf/bpf.go:88
DEBU [SSH:NODE]  Supported ciphers: ["aes128-gcm@openssh.com" "chacha20-poly1305@openssh.com" "aes128-ctr" "aes192-ctr" "aes256-ctr"]. sshutils/server.go:202
DEBU [SSH:NODE]  Supported KEX algorithms: ["curve25519-sha256@libssh.org" "ecdh-sha2-nistp256" "ecdh-sha2-nistp384" "ecdh-sha2-nistp521"]. sshutils/server.go:212
DEBU [SSH:NODE]  Supported MAC algorithms: ["hmac-sha2-256-etm@openssh.com" "hmac-sha2-256"]. sshutils/server.go:222
DEBU [NODE:BEAT] Starting Node heartbeat with announce period: 1m0s, keep-alive period 5m40.122098956s, poll period: 5s srv/heartbeat.go:131
INFO [AUDIT:1]   Creating directory /var/lib/teleport/log. service/service.go:1634
INFO [AUDIT:1]   Creating directory /var/lib/teleport/log/upload. service/service.go:1634
INFO [AUDIT:1]   Creating directory /var/lib/teleport/log/upload/sessions. service/service.go:1634
INFO [AUDIT:1]   Creating directory /var/lib/teleport/log/upload/sessions/default. service/service.go:1634
DEBU [PROC]      Adding service to supervisor. service:uploader.service sync/mutex.go:74
DEBU [PROC]      Adding service to supervisor. service:uploader.shutdown sync/mutex.go:74
INFO [PROC:1]    Service node is creating new listener on 0.0.0.0:3022. service/signals.go:213
DEBU [PROC:1]    Service has started. service:uploader.service service/supervisor.go:242
DEBU [PROC:1]    Service has started. service:uploader.shutdown service/supervisor.go:242
INFO [NODE:1]    Service is starting on 0.0.0.0:3022 no cache policy. service/service.go:1523
INFO [NODE]      Service is starting on 0.0.0.0:3022. utils/cli.go:177
DEBU [PROC:1]    Broadcasting event. event:NodeReady service/supervisor.go:332
DEBU [PROC:1]    Broadcasting mapped event. in:NodeReady out:EventMapping(in=[NodeReady], out=TeleportReady) service/supervisor.go:351
INFO [PROC:1]    The new service has started successfully. Starting syncing rotation status with period 10m0s. service/connect.go:433
DEBU [SSH:NODE]  Listening on [::]:3022. sshutils/server.go:350
DEBU [AUTH]      GRPC(): keep alive 1m0s count: 3. auth/clt.go:309
WARN [PROC:1]    Sync rotation state cycle failed: watcher has disconnected, going to retry after 10s. service/connect.go:440
WARN [NODE:BEAT] Keep alive has failed: transport is closing. srv/heartbeat.go:413
WARN [NODE:BEAT] Heartbeat failed keep alive channel closed. srv/heartbeat.go:240

I tried to check the connection to the auth node using telnet:

telnet teleport-auth.xx.xx.xx 3025
Trying 10.2.22.96...
Connected to internal-xxx.us-west-2.elb.amazonaws.com.
Escape character is '^]'.
Connection closed by foreign host.

Pls advice how should I diagnose next.

Could you share the teleport config files for both auth server and the node?

If there’s a load balancer between node and auth server, does it have an idle connection timeout? Teleport should be sending pings over the connection, but it’s possible that something is closing the connection between the two instances.

Also, if you’re able, please upgrade to the latest version of Teleport (4.3.5) and try again.

Auth node configuration:

auth_service:
  authentication:
    type: saml
  cluster_name: prod
  enabled: 'yes'
  listen_addr: 0.0.0.0:3025
  session_recording: "proxy"
  proxy_checks_host_keys: no
proxy_service:
  enabled: 'no'
ssh_service:
  enabled: 'no'
teleport:
  advertise_ip: xxx.xxx.xxx.ai
  data_dir: /var/lib/teleport
  log:
    output: stderr
    severity: info
  nodename: teleport-auth
  storage:
    audit_events_uri:
    - file:///var/lib/teleport/log
    - dynamodb://prod-teleport-events
    audit_sessions_uri: s3://xxx-xxx-xxx/records
    region: us-west-2
    table_name: prod-teleport-state
    type: dynamodb

I did not use any special configuration for the to-be-add node. I just ran the cli command as described above.

The auth server is indeed using a classic load balancer, and a default idle timeout (60 second) was applied.

It’s difficult to say 100% what’s going on because there are no timestamps in the logs. I can see a couple of potential options:

  1. Can you adjust the timeout on the load balancer to something higher? The default keepalive interval in Teleport is 5 minutes (because the ALB/NLB default timeout is 350 seconds) so it follows that your connection might be closed for being idle after 60 seconds. If you can change the timeout to 350 seconds then Teleport’s default settings will work.

  2. The alternative (without modifying the load balancer config) is to add something like this to your /etc/teleport.yaml on the auth server and restart:

auth_service:
  keep_alive_interval: 30s
  keep_alive_count_max: 3

This will send keepalives every 30 seconds, which should keep the load balancer connection open longer.

Let us know how you get on - thanks.

Thanks for your reply.
I updated the idle timeout to be 360 second and resolved the heartbeat failure.

So I tried to wrap things up for easier deployment but got a new issue.
Here is what I did:

  1. a start_teleport.sh script:
#!/bin/bash
set -ex
set -o pipefail

aws ssm get-parameter --with-decryption --name xxx --query Parameter.Value --output text > /var/lib/teleport/node_join_token

# Get the node join token
aws ssm get-parameter --name xxx --query=Parameter.Value --output text > /var/lib/teleport/ca.cert

sudo teleport start --config=./teleport.yaml \
        --pid-file=/var/lib/teleport/teleport-node.pid \
        --diag-addr=0.0.0.0:3000
  1. node config file
auth_service:
  enabled: 'no'
proxy_service:
  enabled: 'no'
ssh_service:
  enabled: 'yes'
  labels:
    role: crashcart
teleport:
  auth_servers:
  - xxx.xxx.xxx:3025
  auth_token: /var/lib/teleport/node_join_token
  data_dir: /var/lib/teleport
  log:
    output: stderr
    severity: info
  nodename: ArtifactoryEc2
  1. sudo bash start_teleport.sh
INFO [PROC:1]    Service diag is creating new listener on 0.0.0.0:3000. service/signals.go:213
INFO [DIAG:1]    Starting diagnostic service on 0.0.0.0:3000. service/service.go:1754
INFO [PROC]      Connecting to the cluster prod with TLS client certificate. service/connect.go:127
INFO [AUDIT:1]   Creating directory /var/lib/teleport/log. service/service.go:1634
INFO [AUDIT:1]   Creating directory /var/lib/teleport/log/upload. service/service.go:1634
INFO [AUDIT:1]   Creating directory /var/lib/teleport/log/upload/sessions. service/service.go:1634
INFO [AUDIT:1]   Creating directory /var/lib/teleport/log/upload/sessions/default. service/service.go:1634
INFO [PROC:1]    Service node is creating new listener on 0.0.0.0:3022. service/signals.go:213
INFO [NODE:1]    Service is starting on 0.0.0.0:3022 sqlite cache that will expire after connection to database is lost after 20h0m0s, will cache frequently accessed items for 2s. service/service.go:1523
INFO [NODE]      Service is starting on 0.0.0.0:3022. utils/cli.go:177
[NODE]    Service is starting on 0.0.0.0:3022.
INFO [PROC:1]    The new service has started successfully. Starting syncing rotation status with period 10m0s. service/connect.go:433
INFO [PROC:1]    Detected that service started and joined the cluster successfully. service/state.go:56
INFO [PROC:1]    Service Node has updated principals to ["ArtifactoryEc2" "3526f32c-82ee-40ab-9e3c-be315a19c44b" "0.0.0.0"], DNS Names to [], going to request new principals and update. service/connect.go:680
INFO [PROC:1]    Exiting signal handler: process has started internal reload. service/signals.go:126
INFO [PROC]      Started in-process service reload. service/service.go:434
INFO [PROC:2]    Using file descriptor diag 0.0.0.0:3000 passed by the parent process. service/signals.go:207
INFO [DIAG:2]    Starting diagnostic service on 0.0.0.0:3000. service/service.go:1754
INFO [PROC]      Created new process. service/service.go:447
INFO [PROC]      Connecting to the cluster prod with TLS client certificate. service/connect.go:127
INFO [AUDIT:2]   Creating directory /var/lib/teleport/log. service/service.go:1634
INFO [AUDIT:2]   Creating directory /var/lib/teleport/log/upload. service/service.go:1634
INFO [AUDIT:2]   Creating directory /var/lib/teleport/log/upload/sessions. service/service.go:1634
INFO [AUDIT:2]   Creating directory /var/lib/teleport/log/upload/sessions/default. service/service.go:1634
INFO [PROC:2]    Using file descriptor node 0.0.0.0:3022 passed by the parent process. service/signals.go:207
INFO [NODE:2]    Service is starting on 0.0.0.0:3022 sqlite cache that will expire after connection to database is lost after 20h0m0s, will cache frequently accessed items for 2s. service/service.go:1523
INFO [NODE]      Service is starting on 0.0.0.0:3022. utils/cli.go:177
[NODE]    Service is starting on 0.0.0.0:3022.
INFO [PROC:2]    Detected that service started and joined the cluster successfully. service/state.go:56
INFO [PROC]      New service has started successfully. service/service.go:465
INFO [PROC]      Shutting down the old service with timeout 5m0s. service/service.go:475
INFO [PROC:2]    The new service has started successfully. Starting syncing rotation status with period 10m0s. service/connect.go:433
WARN [NODE:1:CA] Re-init the cache on error: context canceled. cache/cache.go:333
INFO [AUDIT:1]   Shutting down. events/uploader.go:316
INFO [AUDIT:1]   Exited. service/service.go:1672
INFO [NODE:1]    Shutting down gracefully. service/service.go:1580
INFO [DIAG:1]    Shutting down gracefully. service/service.go:1768
INFO [NODE:1]    Exited. service/service.go:1568
INFO [NODE:1]    Exited. service/service.go:1592
INFO [DIAG:1]    Exited. service/service.go:1772
INFO [PROC]      The old service was successfully shut down gracefully. service/service.go:498

After the node daemon started, from the auth node, I can see the node was indeed added:

/usr/local/bin/tctl nodes ls
ArtifactoryEc2                xxx-xxx-xxx xxx.xxx.xxx:3022                role=crashcart

And the node also shows up in the proxy website:
image

But logging into the node failed:
failed connecting to node ArtifactoryEc2. error: dial tcp xxx.xxx.xxx.xxx:3022: i/o timeout

BTW, what flag can I use to enable timestamp logging?

Thanks for your reply.
I updated the idle timeout to be 360 second and resolved the heartbeat failure.

That’s great news! Thanks for letting us know.

BTW, what flag can I use to enable timestamp logging?

Unfortunately, Teleport doesn’t currently have any native way to add timestamps to logs. We have an open issue tracking the feature request here. Please add a comment and add a +1 if it’s something you’d like to see implemented - we use comments and +1s to track the importance of issues to customers.

The recommendation currently is to run Teleport via a service manager like systemd or supervisord - something like that can add its own timestamps to the logs to make them more useful.

failed connecting to node ArtifactoryEc2. error: dial tcp xxx.xxx.xxx.xxx:3022: i/o timeout

This error message is pretty much exactly what it describes - when you connect to a Teleport node, your client (either the machine where you’re running the tsh client, or the Teleport proxy server if you’re using the web UI) needs to be able to connect to the node’s IP address on port 3022, much like ssh requires you to be able to connect to port 22.

If you can open up port 3022 on the node, you should be able to connect fine.

This error message is pretty much exactly what it describes - when you connect to a Teleport node, your client (either the machine where you’re running the tsh client, or the Teleport proxy server if you’re using the web UI) needs to be able to connect to the node’s IP address on port 3022, much like ssh requires you to be able to connect to port 22.

I have already opened the 3022 port in the inbound rules of the security group:
image
That should be sufficient to allow connection from either LB, or Proxy node, or Auth node.
Still got this error in the web terminal:

failed connecting to node ArtifactoryEc2. dial tcp 10.2.22.96:3022: i/o timeout

It’s possible that there’s another security group blocking the traffic - is the node in the same VPC, or a different VPC?

If you open a shell on the proxy machine, can you connect to 10.2.22.96:3022 from there using telnet or nc?

On the proxy node:

telnet  10.2.22.96 3022
Trying 10.2.22.96...
telnet: connect to address 10.2.22.96: Connection timed out

I noticed that the ip of the node is actually 10.2.12.56, and I configured the SG of that instance. How come the ip shown up in the cluster becomes 10.2.22.96?

It looks like the outbound traffic is being mutated somehow. Do you have any elastic IPs or NAT gateways associated with that instance?

What’s the output of curl ifconfig.co from that instance?

You can probably work around this by setting advertise_ip in the Teleport config on the node and restarting Teleport:

teleport:
  advertise_ip: 10.2.12.56
1 Like
ubuntu@ip-10-2-12-56:~$ curl ifconfig.co
52.88.116.186

Adding advertise_ip: 10.2.12.56 finally made it work. Thanks!

1 Like