Trusted Cluster SSH exits on interval

I’m running a trusted cluster setup, and after upgrading all nodes to 4.0.4, have run into an issue where any open SSH sessions will get kicked out on a 10 minute interval.

I see this message on the teleport-proxy node in my trusted cluster (say “production”):

Unable to continue processesing requests: heartbeat: connection closed. target:<main_cluster_teleport_proxy_dns>:3024 reversetunnel/agent.go:447
Aug 20 03:43:27 teleport-proxy001 teleport[1233]: WARN [PROXY:AGE] Proxy transport failed: read tcp 172.31.38.185:43428->172.31.49.165:3025: use of closed network connection *net.OpError. target:<main_cluster_teleport_proxy_dns>:3024 reversetunnel/transport.go:318

and I see this log on the telport proxy of my “main” cluster:

Aug 20 03:39:03 teleport-proxy001 teleport[7931]: WARN [REVERSE:B] Re-init the cache on error: all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing ballpit is offline". logrus/entry.go:188

Note that the log above appears with much much more frequency (almost once a second) than the first log (which is once every 10 minutes).

My teleport proxies in my “main” cluster are running behind an AWS NLB.

Any ideas? Here is the output of teleport version:
Teleport v4.0.4 git:v4.0.4-0-g1a2ed507 go1.12.1

Thanks!

Do you have cross zone load balancing enabled on your NLBs?

@hmadison no I do not. should I enable it?

I should also note that this is not a problem if my non-main clusters are running 3.2.6, only 4.0.4. Our main cluster is always running 4.0.4, though.

If you have two proxies behind NLB, cross zone load balancing should be turned on for trusted clusters to work properly, can you try turning this on and see if it helps to prevent trusted clusters going offline?

@sasha we did that and I’m not sure that it worked completely. What did work though was completely deleting the trusted cluster (including the dynamodb table associated with it), and bringing it back online from scratch. Not sure why that worked vs just upgrading, but we haven’t had problems since.