Postgres pod in crash-loop

Problem:

Postgres pod in crash-loop with the following errors in the log:

LOG: database system was shut down at 2019-03-20 14:46:22 UTC
LOG: record with incorrect prev-link B17/E7000003 at 2/8B4B0580
LOG: invalid primary checkpoint record
LOG: record with incorrect prev-link 5C92/52270000 at 2/8B4B0510
LOG: invalid secondary checkpoint record
PANIC: could not locate a valid checkpoint record
LOG: startup process (PID 24) was terminated by signal 6: Aborted
LOG: aborting startup due to startup process failure
LOG: database system is shut down

As a result, other Customer platform pods dependent on postgres were stuck in “init”.

Analysis:

The error indicated that the postgres data directory (in particular, transaction log) was corrupted.

The following Stackoverflow question provided a recommended solution:

https://stackoverflow.com/questions/8799474/postgresql-error-panic-could-not-locate-a-valid-checkpoint-record

Solution:

To repair the transaction log we launched the postgres container and executed the pg_resetxlog command:

$ sudo gravity enter

$ docker run -v /opt/customer/storage/pgdata:/opt/customer/storage/pgdata -ti

leader.telekube.local:5000/postgres:9.6 /bin/sh

// inside postgres container

$ su -u postgres

$ /usr/lib/postgresql/9.6/bin/pg_resetxlog /opt/customer/storage/pgdata

After that, restarted the postgres pod which came back up and all Customer pods launched as well.

Prepared by: @r0mant