Received Immediate Shutdown Request in TimescaleDB

aviral.singh21 · November 28, 2023, 7:55pm

Hello,

In brief:
In Production and Test Setup I am getting immediate shutdown request due to which pgpool also shuts down.

I understand the pgpool behaviour as it looks for primary node and when database is shut down and by the time it restarts during that time pgpool is unable to find primary node hence pgpool is shut down. To solve pgpool issue, I changed restart policy for pgpool to any instead of on-failure.

But database issue is still intact.

TimescaleDB version: 2.7.1
PostgreSQL version: 13.8
Other software: Patroni, ETCD, PgPool
OS: Ubuntu 22.04.1 (inside docker container)
Docker Image Used: timescale/timescaledb-ha:pg13.7-ts-2.7.1-latest
Install method: Docker Swarm
Environment: Production and Test

Detailed Scenario for Production:

This issue first occurred on 3rd November 23:30 UTC, at that time Leader was on Node-2.
This issue was faced till 07th November constantly & randomly, last shutdown request occurred at 08:42 UTC until 10th November 10:40 UTC when shutdown request was received again.
On 10th November 10:40 UTC, again this occurred and this time the Leader was switched from Node-2 to Node-1 and after that Leader was switched among Node-1 & Node-2 two-three times more.
On 16th November around 11:33 UTC, Leader was switched from Node-1 to Node-3 and since then Database seemed stable for few days till 24th November where it received immediate shutdown request twice at 12:04 UTC and 13:04 UTC. After that again it is stable now.

Hardware Specs:
CPU: 8 cores on all 3 nodes.
Memory(free command): 31 GB in Node-1 & Node-2 and 39 GB in Node-3

I checked CPU & Memory Usage at the Node Level and as well at Database Level. But there was no intense usage observed, graph metrics were same as it was used to be on normal days.
I think there might be a possibility of having a shared memory (shared_buffers) issue in PostgreSQL/TimescaleDB. But I am not able to find any rock solid proof for it.

Scenario for Test:

First time error occurred on 14th November and issue is same as explained above for Production.
For analysing why it is happening, I installed pg_buffercache, pg_stat_statements and pg_profile(for PWR) in Test Server Setup.
After 14th Nov, same issue occurred again on 26th-27th November. Received immediate shutdown request thrice at 2023-11-26 23:54:56 UTC, 2023-11-27 00:15:49 UTC & 2023-11-27 01:02:32 UTC.
Since already pg_profile and other 2 extensions were setup but no helpful information was gathered through these extensions. In fact at that time there was no heavy load was put on database in Test Server.

I have more insight on CPU & Memory Statistics in different methods. Like in grafana, Docker Swarm Service and Docker and System Monitoring dashboards are being used for monitoring docker service & docker containers. But when I monitor the Memory Usage using different method, I get different values in everywhere.

Docker Swarm Service: 11:22:30 UTC: 655 MB
Docker and System Dashboard: 11:22:30 UTC: 135 MB
Docker Stats: 11:22:33 UTC: 520.1 MB

Logs for Production:

2023-11-14 08:05:34 UTC [26]: [6552b670.1a-7] @,app= [00000] LOG:  received immediate shutdown request
2023-11-14 08:05:34 UTC [10748]: [65532a47.29fc-3] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] WARNING:  terminating connection because of crash of another server process
2023-11-14 08:05:34 UTC [10748]: [65532a47.29fc-4] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-11-14 08:05:34 UTC [10748]: [65532a47.29fc-5] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2023-11-14 08:05:34 UTC [10743]: [65532a3c.29f7-3] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] WARNING:  terminating connection because of crash of another server process
2023-11-14 08:05:34 UTC [10743]: [65532a3c.29f7-4] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-11-14 08:05:34 UTC [10743]: [65532a3c.29f7-5] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
...
...
2023-11-14 08:05:34 UTC [26]: [6552b670.1a-8] @,app= [00000] LOG:  archiver process (PID 45) exited with exit code 2
2023-11-14 08:05:35 UTC [26]: [6552b670.1a-9] @,app= [00000] LOG:  database system is shut down

Logs for Test:

2023-11-26 23:54:56 UTC [26]: [6563ced6.1a-11] @,app= [00000] LOG:  received immediate shutdown request
2023-11-26 23:54:56 UTC [1089]: [6563da15.441-3] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] WARNING:  terminating connection because of crash of another server process
2023-11-26 23:54:56 UTC [1089]: [6563da15.441-4] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-11-26 23:54:56 UTC [1089]: [6563da15.441-5] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
...
...
2023-11-26 23:54:56 UTC [740]: [6563d657.2e4-3] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] WARNING:  terminating connection because of crash of another server process
2023-11-26 23:54:56 UTC [740]: [6563d657.2e4-4] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-11-26 23:54:56 UTC [740]: [6563d657.2e4-5] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2023-11-26 23:54:56 UTC [26]: [6563ced6.1a-12] @,app= [00000] LOG:  archiver process (PID 47) exited with exit code 2
2023-11-26 23:54:56 UTC [46]: [6563ced9.2e-1] @,app= [57P02] WARNING:  terminating connection because of crash of another server process
2023-11-26 23:54:56 UTC [46]: [6563ced9.2e-2] @,app= [57P02] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-11-26 23:54:56 UTC [46]: [6563ced9.2e-3] @,app= [57P02] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2023-11-26 23:54:56 UTC [26]: [6563ced6.1a-13] @,app= [00000] LOG:  database system is shut down

jonatasdp · November 29, 2023, 8:42pm

It seems like something is killing process. Maybe some Out of Memory Error

Have you checked if any memory killer is not killing the process?

Anurup_Mondal · June 19, 2024, 3:53am

I am also having the same issue server is being shutdown but the container is still up

The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /var/lib/postgresql/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... UTC
creating configuration files ... ok
running bootstrap script ... ok
sh: locale: not found
2024-06-19 03:52:05.899 UTC [31] WARNING:  no usable system locales were found
performing post-bootstrap initialization ... ok
initdb: warning: enabling "trust" authentication for local connections
initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.
syncing data to disk ... ok
Success. You can now start the database server using:
    pg_ctl -D /var/lib/postgresql/data -l logfile start
waiting for server to start....2024-06-19 03:52:14.235 UTC [37] LOG:  starting PostgreSQL 16.3 on x86_64-pc-linux-musl, compiled by gcc (Alpine 13.2.1_git20240309) 13.2.1 20240309, 64-bit
2024-06-19 03:52:14.241 UTC [37] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-06-19 03:52:14.271 UTC [40] LOG:  database system was shut down at 2024-06-19 03:52:11 UTC
2024-06-19 03:52:14.317 UTC [37] LOG:  database system is ready to accept connections
2024-06-19 03:52:14.327 UTC [43] LOG:  TimescaleDB background worker launcher connected to shared catalogs
 done
server started
CREATE DATABASE
/usr/local/bin/docker-entrypoint.sh: sourcing /docker-entrypoint-initdb.d/000_install_timescaledb.sh
CREATE EXTENSION
CREATE EXTENSION
2024-06-19 03:52:18.180 UTC [61] ERROR:  TimescaleDB background worker connected to template database, exiting
2024-06-19 03:52:18.182 UTC [37] LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 61) exited with exit code 1
CREATE EXTENSION
/usr/local/bin/docker-entrypoint.sh: running /docker-entrypoint-initdb.d/001_timescaledb_tune.sh
Using postgresql.conf at this path:
/var/lib/postgresql/data/postgresql.conf
shared_buffers = 3897MB
effective_cache_size = 11691MB
maintenance_work_mem = 1948MB
work_mem = 2494kB
timescaledb.max_background_workers = 16
max_worker_processes = 35
max_parallel_workers_per_gather = 8
max_parallel_workers = 16
wal_buffers = 16MB
min_wal_size = 512MB
default_statistics_target = 100
random_page_cost = 1.1
checkpoint_completion_target = 0.9
max_locks_per_transaction = 256
autovacuum_max_workers = 10
autovacuum_naptime = 10
effective_io_concurrency = 256
timescaledb.last_tuned = '2024-06-19T03:52:19Z'
timescaledb.last_tuned_version = '0.15.0'
Writing backup to:
/tmp/timescaledb_tune.backup202406190352
Recommendations based on 15.22 GB of available memory and 16 CPUs for PostgreSQL 16
Saving changes to: /var/lib/postgresql/data/postgresql.conf
/usr/local/bin/docker-entrypoint.sh: running /docker-entrypoint-initdb.d/init.sql
CREATE SCHEMA
CREATE TABLE
CREATE TABLE
CREATE TABLE
  create_hypertable   
----------------------
 (1,opcua,tag_data,t)
(1 row)
CREATE TABLE
waiting for server to shut down...2024-06-19 03:52:19.391 UTC [37] LOG:  received fast shutdown request
.2024-06-19 03:52:19.396 UTC [37] LOG:  aborting any active transactions
2024-06-19 03:52:19.396 UTC [64] FATAL:  terminating background worker "TimescaleDB Background Worker Scheduler" due to administrator command
2024-06-19 03:52:19.396 UTC [55] FATAL:  terminating background worker "TimescaleDB Background Worker Scheduler" due to administrator command
2024-06-19 03:52:19.396 UTC [43] FATAL:  terminating background worker "TimescaleDB Background Worker Launcher" due to administrator command
2024-06-19 03:52:19.400 UTC [37] LOG:  background worker "TimescaleDB Background Worker Launcher" (PID 43) exited with exit code 1
2024-06-19 03:52:19.400 UTC [37] LOG:  background worker "logical replication launcher" (PID 44) exited with exit code 1
2024-06-19 03:52:19.402 UTC [37] LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 64) exited with exit code 1
2024-06-19 03:52:19.402 UTC [37] LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 55) exited with exit code 1
2024-06-19 03:52:19.415 UTC [38] LOG:  shutting down
2024-06-19 03:52:19.418 UTC [38] LOG:  checkpoint starting: shutdown immediate
.2024-06-19 03:52:21.228 UTC [38] LOG:  checkpoint complete: wrote 1745 buffers (10.7%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.951 s, sync=0.826 s, total=1.814 s; sync files=770, longest=0.013 s, average=0.002 s; distance=9687 kB, estimate=9687 kB; lsn=0/1E659D8, redo lsn=0/1E659D8
.2024-06-19 03:52:21.503 UTC [37] LOG:  database system is shut down
 done
server stopped
PostgreSQL init process complete; ready for start up.
2024-06-19 03:52:21.682 UTC [1] LOG:  starting PostgreSQL 16.3 on x86_64-pc-linux-musl, compiled by gcc (Alpine 13.2.1_git20240309) 13.2.1 20240309, 64-bit
2024-06-19 03:52:21.682 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-06-19 03:52:21.682 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2024-06-19 03:52:21.692 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-06-19 03:52:21.725 UTC [88] LOG:  database system was shut down at 2024-06-19 03:52:21 UTC
2024-06-19 03:52:21.938 UTC [1] LOG:  database system is ready to accept connections
2024-06-19 03:52:21.947 UTC [91] LOG:  TimescaleDB background worker launcher connected to shared catalogs

aviral.singh21 · July 8, 2024, 12:44pm

Hello @Anurup_Mondal,

I faced this issue on 2 separate setup.

In one setup, there was memory issue. There was one service apart from database which was consuing resources. Fixing that turned into resolution of database restart issue.
In another setup, I deployed ETCD service on different mountpoint. As Postgres could easily saturate all available IOPS and ETCD will struggle.

You can refer to this link: Slack