Hello,
In brief:
In Production and Test Setup I am getting immediate shutdown request due to which pgpool also shuts down.
I understand the pgpool behaviour as it looks for primary node and when database is shut down and by the time it restarts during that time pgpool is unable to find primary node hence pgpool is shut down. To solve pgpool issue, I changed restart policy for pgpool to any instead of on-failure.
But database issue is still intact.
TimescaleDB version: 2.7.1
PostgreSQL version: 13.8
Other software: Patroni, ETCD, PgPool
OS: Ubuntu 22.04.1 (inside docker container)
Docker Image Used: timescale/timescaledb-ha:pg13.7-ts-2.7.1-latest
Install method: Docker Swarm
Environment: Production and Test
Detailed Scenario for Production:
- This issue first occurred on 3rd November 23:30 UTC, at that time Leader was on Node-2.
- This issue was faced till 07th November constantly & randomly, last shutdown request occurred at 08:42 UTC until 10th November 10:40 UTC when shutdown request was received again.
- On 10th November 10:40 UTC, again this occurred and this time the Leader was switched from Node-2 to Node-1 and after that Leader was switched among Node-1 & Node-2 two-three times more.
- On 16th November around 11:33 UTC, Leader was switched from Node-1 to Node-3 and since then Database seemed stable for few days till 24th November where it received immediate shutdown request twice at 12:04 UTC and 13:04 UTC. After that again it is stable now.
Hardware Specs:
CPU: 8 cores on all 3 nodes.
Memory(free command): 31 GB in Node-1 & Node-2 and 39 GB in Node-3
I checked CPU & Memory Usage at the Node Level and as well at Database Level. But there was no intense usage observed, graph metrics were same as it was used to be on normal days.
I think there might be a possibility of having a shared memory (shared_buffers) issue in PostgreSQL/TimescaleDB. But I am not able to find any rock solid proof for it.
Scenario for Test:
- First time error occurred on 14th November and issue is same as explained above for Production.
- For analysing why it is happening, I installed pg_buffercache, pg_stat_statements and pg_profile(for PWR) in Test Server Setup.
- After 14th Nov, same issue occurred again on 26th-27th November. Received immediate shutdown request thrice at 2023-11-26 23:54:56 UTC, 2023-11-27 00:15:49 UTC & 2023-11-27 01:02:32 UTC.
- Since already pg_profile and other 2 extensions were setup but no helpful information was gathered through these extensions. In fact at that time there was no heavy load was put on database in Test Server.
I have more insight on CPU & Memory Statistics in different methods. Like in grafana, Docker Swarm Service and Docker and System Monitoring dashboards are being used for monitoring docker service & docker containers. But when I monitor the Memory Usage using different method, I get different values in everywhere.
Docker Swarm Service: 11:22:30 UTC: 655 MB
Docker and System Dashboard: 11:22:30 UTC: 135 MB
Docker Stats: 11:22:33 UTC: 520.1 MB
Logs for Production:
2023-11-14 08:05:34 UTC [26]: [6552b670.1a-7] @,app= [00000] LOG: received immediate shutdown request
2023-11-14 08:05:34 UTC [10748]: [65532a47.29fc-3] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] WARNING: terminating connection because of crash of another server process
2023-11-14 08:05:34 UTC [10748]: [65532a47.29fc-4] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-11-14 08:05:34 UTC [10748]: [65532a47.29fc-5] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2023-11-14 08:05:34 UTC [10743]: [65532a3c.29f7-3] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] WARNING: terminating connection because of crash of another server process
2023-11-14 08:05:34 UTC [10743]: [65532a3c.29f7-4] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-11-14 08:05:34 UTC [10743]: [65532a3c.29f7-5] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] HINT: In a moment you should be able to reconnect to the database and repeat your command.
...
...
2023-11-14 08:05:34 UTC [26]: [6552b670.1a-8] @,app= [00000] LOG: archiver process (PID 45) exited with exit code 2
2023-11-14 08:05:35 UTC [26]: [6552b670.1a-9] @,app= [00000] LOG: database system is shut down
Logs for Test:
2023-11-26 23:54:56 UTC [26]: [6563ced6.1a-11] @,app= [00000] LOG: received immediate shutdown request
2023-11-26 23:54:56 UTC [1089]: [6563da15.441-3] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] WARNING: terminating connection because of crash of another server process
2023-11-26 23:54:56 UTC [1089]: [6563da15.441-4] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-11-26 23:54:56 UTC [1089]: [6563da15.441-5] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] HINT: In a moment you should be able to reconnect to the database and repeat your command.
...
...
2023-11-26 23:54:56 UTC [740]: [6563d657.2e4-3] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] WARNING: terminating connection because of crash of another server process
2023-11-26 23:54:56 UTC [740]: [6563d657.2e4-4] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-11-26 23:54:56 UTC [740]: [6563d657.2e4-5] tsdb@tsdb,app=PostgreSQL JDBC Driver [57P02] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2023-11-26 23:54:56 UTC [26]: [6563ced6.1a-12] @,app= [00000] LOG: archiver process (PID 47) exited with exit code 2
2023-11-26 23:54:56 UTC [46]: [6563ced9.2e-1] @,app= [57P02] WARNING: terminating connection because of crash of another server process
2023-11-26 23:54:56 UTC [46]: [6563ced9.2e-2] @,app= [57P02] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-11-26 23:54:56 UTC [46]: [6563ced9.2e-3] @,app= [57P02] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2023-11-26 23:54:56 UTC [26]: [6563ced6.1a-13] @,app= [00000] LOG: database system is shut down