Failed in last phase of execute election phase

2019-05-28T15:58:08Z DEBU [SUBSYSTEM] Starting subsystem trace.fields:map[src:10.162.168.18:45778 dst:10.162.168.18:61023] install/hook.go:56
2019-05-28T15:58:08Z INFO [SUBSYSTEM] Connected to auth server: 127.0.0.1:61025 trace.fields:map[src:10.162.168.18:45778 dst:10.162.168.18:61023] install/hook.go:56
2019-05-28T15:58:09Z INFO             Connected to installer. phase:/connect-installer install/hook.go:56
2019-05-28T15:58:09Z DEBU [FSM:INSTA] Applied StateChange(Phase=/connect-installer, State=completed). opid:e2bb2a7v-f044-4c18-9d7b-6f06665ccfc0 install/hook.go:56
2019-05-28T15:58:09Z DEBU [FSM:INSTA] Executing phase "/election". opid:e2bb2a7v-f044-4c18-9d7b-6f06665ccfc0 install/hook.go:56
2019-05-28T15:58:10Z DEBU [OPS]       Created: ops.ProgressEntry{ID:"", SiteDomain:"lirany", OperationID:"e2bb2a7v-f044-4c18-9d7b-6f06665ccfc0", Created:time.Time{wall:0x2dc95bb, ext:63694655890, loc:(*time.Location)(nil)}, Completion:54, Step:9, State:"in_progress", Message:"Enable cluster leader elections"}. install/hook.go:56
2019-05-28T15:58:10Z DEBU [FSM:INSTA] Applied StateChange(Phase=/election, State=in_progress). opid:e2bb2a7v-f044-4c18-9d7b-6f06665ccfc0 install/hook.go:56
2019-05-28T15:58:10Z INFO             Executing phase: /election. advertise-ip:10.162.168.18 hostname:localhost.localdomain phase:/election install/hook.go:56
2019-05-28T15:58:10Z INFO             Enable leader elections advertise-ip:10.162.168.18 hostname:localhost.localdomain phase:/election install/hook.go:56
2019-05-28T15:58:12Z ERRO             "All attempts failed: \nERROR REPORT:\nOriginal Error: *exec.ExitError exit status 255\nStack Trace:\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/exec.go:128 github.com/gravitational/gravity/lib/utils.RunStream\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/exec.go:91 github.com/gravitational/gravity/lib/utils.RunCommand\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/exec.go:80 github.com/gravitational/gravity/lib/utils.RunPlanetCommand\n\t/gopath/src/github.com/gravitational/gravity/lib/install/phases/election.go:102 github.com/gravitational/gravity/lib/install/phases.(*enableElectionExecutor).resumeLeader\n\t/gopath/src/github.com/gravitational/gravity/lib/install/phases/election.go:74 github.com/gravitational/gravity/lib/install/phases.(*enableElectionExecutor).Execute.func1\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/retry.go:209 github.com/gravitational/gravity/lib/utils.RetryTransient.func1\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/retry.go:235 github.com/gravitational/gravity/lib/utils.RetryWithInterval.func1\n\t/gopath/src/github.com/gravitational/gravity/vendor/github.com/cenkalti/backoff/retry.go:37 github.com/gravitational/gravity/vendor/github.com/cenkalti/backoff.RetryNotify\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/retry.go:234 github.com/gravitational/gravity/lib/utils.RetryWithInterval\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/retry.go:208 github.com/gravitational/gravity/lib/utils.RetryTransient\n\t/gopath/src/github.com/gravitational/gravity/lib/install/phases/election.go:73 github.com/gravitational/gravity/lib/install/phases.(*enableElectionExecutor).Execute\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:421 github.com/gravitational/gravity/lib/fsm.(*FSM).executeOnePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:355 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhaseLocally\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:315 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:192 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:150 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePlan\n\t/gopath/src/github.com/gravitational/gravity/lib/install/flow.go:335 github.com/gravitational/gravity/lib/install.(*Installer).startFSM\n\t/go/src/runtime/asm_amd64.s:1334 runtime.goexit\nUser Message: failed to enable election for master 10.162.168.18. reason: [ERROR]: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://127.0.0.1:2379 exceeded header timeout\n\n\n." install/hook.go:56
2019-05-28T15:58:12Z ERRO             "Phase execution failed: failed to enable election for master 10.162.168.18. reason: [ERROR]: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://127.0.0.1:2379 exceeded header timeout\n\n." advertise-ip:10.162.168.18 hostname:localhost.localdomain phase:/election install/hook.go:56
2019-05-28T15:58:12Z DEBU [SSH:PROXY] incoming connection 10.162.168.18:49704 -> 10.162.168.18:61024 vesion: SSH-2.0-Go install/hook.go:56
2019-05-28T15:58:12Z DEBU [PROXY:SER] new tunnel from 10.162.168.18:49704 install/hook.go:56
2019-05-28T15:58:12Z DEBU [PROXY:SER] Using TLS client to remote cluster. cluster:lirany install/hook.go:56
2019-05-28T15:58:12Z DEBU [BACKEND:D] Found new flat keyspace, skipping migration. dir:/tmp/gravity-wizard298886830/cache/reverse/lirany install/hook.go:56
2019-05-28T15:58:12Z DEBU [PROXY:SER] Requesting teleport-transport-dial connection to remote site with payload: @remote-auth-server cluster:lirany install/hook.go:56
2019-05-28T15:58:12Z WARN [CLIENT:CA] Connection problem (Get https://teleport.cluster.local/v2/domain: lirany is offline) failed connect to the auth servers, using local cache. install/hook.go:56
2019-05-28T15:58:12Z WARN [CLIENT:CA] Backoff: using cached value due to recent errors. install/hook.go:56
2019-05-28T15:58:12Z WARN [CLIENT:CA] Backoff: using cached value due to recent errors. install/hook.go:56
2019-05-28T15:58:12Z WARN [CLIENT:CA] Backoff: using cached value due to recent errors. install/hook.go:56
2019-05-28T15:58:12Z WARN [CLIENT:CA] Backoff: using cached value due to recent errors. install/hook.go:56
2019-05-28T15:58:12Z WARN [CLIENT:CA] Backoff: using cached value due to recent errors. install/hook.go:56
2019-05-28T15:58:12Z WARN [CLIENT:CA] Backoff: using cached value due to recent errors. install/hook.go:56
2019-05-28T15:58:12Z WARN [CLIENT:CA] Backoff: using cached value due to recent errors. install/hook.go:56
2019-05-28T15:58:12Z WARN [CLIENT:CA] Backoff: using cached value due to recent errors. install/hook.go:56
2019-05-28T15:58:12Z DEBU [PROXY:SER] Requesting teleport-transport-dial connection to remote site with payload: @remote-auth-server cluster:lirany install/hook.go:56
2019-05-28T15:58:12Z DEBU [CLIENT:CA] Auth cache: open /tmp/gravity-wizard298886830/cache/reverse/lirany/localCluster: no such file or directory, cluster configuration not found, Get https://teleport.cluster.local/v2/tunnelconnections: lirany is offline. install/hook.go:56
2019-05-28T15:58:12Z INFO [PROXY:SER] Connection <- 10.162.168.18:49704, clusters: 1. cluster:lirany install/hook.go:56
2019-05-28T15:58:12Z DEBU [PROXY:SER] Ticking with period 10s cluster:lirany install/hook.go:56
2019-05-28T15:58:12Z DEBU [PROXY:SER] ping <- 10.162.168.18:49704 cluster:lirany install/hook.go:56
2019-05-28T15:58:14Z DEBU [FSM:INSTA] "Applied StateChange(Phase=/election, State=failed, Error=failed to enable election for master 10.162.168.18. reason: [ERROR]: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://127.0.0.1:2379 exceeded header timeout\n\n)." opid:e2bb2a7v-f044-4c18-9d7b-6f06665ccfc0 install/hook.go:56
2019-05-28T15:58:14Z ERRO [INSTALLER] "Failed to execute plan: \nERROR REPORT:\nOriginal Error: *exec.ExitError exit status 255\nStack Trace:\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/exec.go:128 github.com/gravitational/gravity/lib/utils.RunStream\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/exec.go:91 github.com/gravitational/gravity/lib/utils.RunCommand\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/exec.go:80 github.com/gravitational/gravity/lib/utils.RunPlanetCommand\n\t/gopath/src/github.com/gravitational/gravity/lib/install/phases/election.go:102 github.com/gravitational/gravity/lib/install/phases.(*enableElectionExecutor).resumeLeader\n\t/gopath/src/github.com/gravitational/gravity/lib/install/phases/election.go:74 github.com/gravitational/gravity/lib/install/phases.(*enableElectionExecutor).Execute.func1\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/retry.go:209 github.com/gravitational/gravity/lib/utils.RetryTransient.func1\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/retry.go:235 github.com/gravitational/gravity/lib/utils.RetryWithInterval.func1\n\t/gopath/src/github.com/gravitational/gravity/vendor/github.com/cenkalti/backoff/retry.go:37 github.com/gravitational/gravity/vendor/github.com/cenkalti/backoff.RetryNotify\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/retry.go:234 github.com/gravitational/gravity/lib/utils.RetryWithInterval\n\t/gopath/src/github.com/gravitational/gravity/lib/utils/retry.go:208 github.com/gravitational/gravity/lib/utils.RetryTransient\n\t/gopath/src/github.com/gravitational/gravity/lib/install/phases/election.go:73 github.com/gravitational/gravity/lib/install/phases.(*enableElectionExecutor).Execute\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:421 github.com/gravitational/gravity/lib/fsm.(*FSM).executeOnePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:355 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhaseLocally\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:315 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:192 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePhase\n\t/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:150 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePlan\n\t/gopath/src/github.com/gravitational/gravity/lib/install/flow.go:335 github.com/gravitational/gravity/lib/install.(*Installer).startFSM\n\t/go/src/runtime/asm_amd64.s:1334 runtime.goexit\nUser Message: failed to enable election for master 10.162.168.18. reason: [ERROR]: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://127.0.0.1:2379 exceeded header timeout\n\n, failed to execute phase \"/election\"\n." install/hook.go:56
2019-05-28T15:58:15Z INFO [OPS]       ops.SetOperationStateRequest{State:"failed", Progress:(*ops.ProgressEntry)(0xc005b74500)} install/hook.go:56
2019-05-28T15:58:17Z DEBU [PROXY:SER] ping <- 10.162.168.18:49704 latency:281.257µs cluster:lirany install/hook.go:56
2019-05-28T15:58:17Z DEBU [OPS]       Created: ops.ProgressEntry{ID:"", SiteDomain:"lirany", OperationID:"e2bb2a7v-f044-4c18-9d7b-6f06665ccfc0", Created:time.Time{wall:0xc905493, ext:63694655895, loc:(*time.Location)(nil)}, Completion:100, Step:9, State:"failed", Message:"Operation failure: exit status 255"}. install/hook.go:56
2019-05-28T15:58:17Z DEBU [FSM:INSTA] Marked operation complete. opid:e2bb2a7v-f044-4c18-9d7b-6f06665ccfc0 install/hook.go:56
2019-05-28T15:58:18Z INFO             Operation failed. install/hook.go:56

So the key on this error is the part about etcd: 2019-05-28T15:58:14Z DEBU [FSM:INSTA] "Applied StateChange(Phase=/election, State=failed, Error=failed to enable election for master 10.162.168.18. reason: [ERROR]: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://127.0.0.1:2379 exceeded header timeout

The most common cause for errors like this are etcd not having enough IOPS available to keep up. We generally recommend where available to run etcd on it’s own disk, with atleast 1500 provisioned IOPS: https://gravitational.com/gravity/docs/requirements/#etcd-disk

To confirm if this is the case or not, the next troubleshooting step would be to use gravity shell to get a shell in the planet container we use to run kubernetes / etcd / docker / etc and look at the etcd logs (journalctl -u etcd). If etcd is throwing errors related to operations taking too long, it’s generally indicating that the disk is too slow.

As you mentioned it’s look that it’s refer to “the taking long time”

Jun 16 10:53:27 localhost.localdomain etcd[314]: read-only range request “key:”/registry/services/endpoints/kube-system/kube-controller-manager" " with result “range_response_count:1 size:462” took too long (289.343879ms) to execute
Jun 16 10:53:27 localhost.localdomain etcd[314]: read-only range request “key:”/registry/services/specs/default/kubernetes" " with result “range_response_count:1 size:292” took too long (118.399698ms) to execute

one more issue is when i did gravity plan from the gravity shell it showed that the election completed
-
✓ election Enable cluster leader elections Completed - /app -

but from the host when i run the gravity plan it’s shows that it failed
Sun Jun 16 10:40 UTC
× election Enable cluster leader elections Failed - /app Sun Jun 16 10:40 UTC
The /election phase (“Enable cluster leader elections”) has failed
exit status 255

Can you advise?

thanks

Hmmm, I’m not sure those queries taking too long confirms that etcd / slow disks is the culprit. I do see a couple queries taking long from time to time, and unless it leads to fail-overs or more severe logs from etcd, I’m not sure these logs confirms the disk is slow.

The discrepancy between running gravity plan from inside/outside of planet could be due to the way we track state during install. It’s possible that although when saving the state to etcd, etcd appeared to timeout and the install gave up (the plan on the host), that etcd did actually receive the state update, which is shown when running gravity plan from within planet.

Were there any etcd logs around 2019-05-28T15:58:12Z when the etcd error occurred? Does this error come up consistently. Also, can you provide any insight into the environment, is it a VM, multiple nodes or a single node, running on AWS, etc?