I’m running a trusted cluster setup, and after upgrading all nodes to 4.0.4, have run into an issue where any open SSH sessions will get kicked out on a 10 minute interval.
I see this message on the teleport-proxy node in my trusted cluster (say “production”):
Unable to continue processesing requests: heartbeat: connection closed. target:<main_cluster_teleport_proxy_dns>:3024 reversetunnel/agent.go:447 Aug 20 03:43:27 teleport-proxy001 teleport: WARN [PROXY:AGE] Proxy transport failed: read tcp 172.31.38.185:43428->172.31.49.165:3025: use of closed network connection *net.OpError. target:<main_cluster_teleport_proxy_dns>:3024 reversetunnel/transport.go:318
and I see this log on the telport proxy of my “main” cluster:
Aug 20 03:39:03 teleport-proxy001 teleport: WARN [REVERSE:B] Re-init the cache on error: all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing ballpit is offline". logrus/entry.go:188
Note that the log above appears with much much more frequency (almost once a second) than the first log (which is once every 10 minutes).
My teleport proxies in my “main” cluster are running behind an AWS NLB.
Any ideas? Here is the output of
Teleport v4.0.4 git:v4.0.4-0-g1a2ed507 go1.12.1