Hello,
In Production Setup I have faced Replication Lag in either both or one of the Replica Nodes in the gap of 10 days. Although this is the most common issue in HA Database. It used to occur once in a month but this time it occurred quite frequently.
Installation Method: Docker (timescale/timescaledb-ha:pg13.7-ts-2.7.1-latest)
OS: Ubuntu 22.04.1 (inside docker container)
PostgreSQL Version: 13.8
TimescaleDB Version: 2.7.1
Setup: 3 Node of Patroni-HA(PGSQL+TSDB) - ETCD and PgPool using Docker Swarm.
Patroni Version: 2.1.4
ETCD Version: 3.5.0
One Solution which I tried: When I saw the log related to leader failed to send heartbeat on time in ETCD. I thought that the heartbeat-interval and election-timeout value are causing problem because we are using default values. So I tried by changing the value of heartbeat-interval to 500ms and election-timeout value to 3000ms. But still I was getting the same logs, “leader failed to send heartbeat on time”.
Is there any issue in TimescaleDB as earlier whenever I faced the replication lag issue in Patroni Cluster, timeline used to increase but this time whenever replication lag issue occurred, timeline was not changed. Only log related to reloading configuration was logged in TimescaleDB Logs.
I request to please help me identify the root cause behind this frequent replication lag.
I am sharing the logs for each services used related to Database HA Setup in the order of time of occurrence of unusual logs in different services.
Node2(Leader): ETCD Log::
{“level”:“warn”,“ts”:“2023-09-18T20:35:29.269Z”,“caller”:“etcdserver/raft.go:369”,“msg”:“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”,“to”:“634888d6d9dce83c”,“heartbeat-interval”:“100ms”,“expected-duration”:“200ms”,“exceeded-duration”:“378.187007ms”}
{“level”:“warn”,“ts”:“2023-09-18T20:35:29.269Z”,“caller”:“etcdserver/raft.go:369”,“msg”:“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”,“to”:“1ae8e1aa55b8dfeb”,“heartbeat-interval”:“100ms”,“expected-duration”:“200ms”,“exceeded-duration”:“378.348601ms”}
{“level”:“warn”,“ts”:“2023-09-18T20:38:10.565Z”,“caller”:“rafthttp/probing_status.go:82”,“msg”:“prober found high clock drift”,“round-tripper-name”:“ROUND_TRIPPER_SNAPSHOT”,“remote-peer-id”:“634888d6d9dce83c”,“clock-drift”:“3.271107301s”,“rtt”:“1.50669ms”}
{“level”:“warn”,“ts”:“2023-09-18T20:38:10.614Z”,“caller”:“rafthttp/probing_status.go:82”,“msg”:“prober found high clock drift”,“round-tripper-name”:“ROUND_TRIPPER_RAFT_MESSAGE”,“remote-peer-id”:“634888d6d9dce83c”,“clock-drift”:“3.270118824s”,“rtt”:“4.186608ms”}
{“level”:“warn”,“ts”:“2023-09-18T20:38:40.566Z”,“caller”:“rafthttp/probing_status.go:82”,“msg”:“prober found high clock drift”,“round-tripper-name”:“ROUND_TRIPPER_SNAPSHOT”,“remote-peer-id”:“634888d6d9dce83c”,“clock-drift”:“5.424953406s”,“rtt”:“1.706893ms”}
Node1(Replica): TimescaleDB Log::
2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=127.0.0.1 port=53422
2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-2] postgres@[unknown],app=[unknown] [00000] LOG: replication connection authorized: user=postgres
2023-09-18 20:41:53 UTC [3147948]: [6508b611.3008ac-3] postgres@[unknown],app=[unknown] [00000] LOG: disconnection: session time: 0:00:00.003 user=postgres database= host=127.0.0.1 port=53422
server signaled
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-9] @,app= [00000] LOG: received SIGHUP, reloading configuration files
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-10] @,app= [00000] LOG: parameter “primary_conninfo” changed to “user=postgres passfile=/home/postgres/.pgpass.patroni host=timescale-2 port=5432 sslmode=prefer application_name=timescaledb-1 gssencmode=prefer channel_binding=prefer”
2023-09-18 20:41:53 UTC [3111817]: [6508345d.2f7b89-11] @,app= [00000] LOG: parameter “primary_slot_name” changed to “timescaledb_1”
2023-09-18 20:41:53 UTC [3147952]: [6508b611.3008b0-1] @,app= [XX000] FATAL: could not start WAL streaming: ERROR: replication slot “timescaledb_1” does not exist
2023-09-18 20:41:55 UTC [3145823]: [6508af31.30005f-3] tsdb@tsdb,app=[unknown] [00000] LOG: disconnection: session time: 0:29:21.936 user=tsdb database=tsdb host=...** port=37952
2023-09-18 20:41:55 UTC [3147954]: [6508b613.3008b2-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=...** port=56024
2023-09-18 20:41:55 UTC [3147954]: [6508b613.3008b2-2] tsdb@tsdb,app=[unknown] [00000] LOG: connection authorized: user=tsdb database=tsdb
Node1(Replica): PATRONI Log::
2023-09-18 20:42:00,449 WARNING: Master (timescaledb-2) is still alive
2023-09-18 20:42:10,465 WARNING: Master (timescaledb-2) is still alive
2023-09-18 20:42:20,469 WARNING: Master (timescaledb-2) is still alive
Node2(Leader): TimescaleDB Log::
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=...** port=33918
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-2] postgres@[unknown],app=[unknown] [00000] LOG: replication connection authorized: user=postgres application_name=timescaledb-1
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-3] postgres@[unknown],app=timescaledb-1 [42704] ERROR: replication slot “timescaledb_1” does not exist
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-4] postgres@[unknown],app=timescaledb-1 [42704] STATEMENT: START_REPLICATION SLOT “timescaledb_1” 529/9D000000 TIMELINE 50
2023-09-18 20:42:15 UTC [391466]: [6508b627.5f92a-5] postgres@[unknown],app=timescaledb-1 [00000] LOG: disconnection: session time: 0:00:00.006 user=postgres database= host=...** port=33918
Node1(Replica): ETCD Log::
{“level”:“warn”,“ts”:“2023-09-18T21:35:32.488Z”,“caller”:“wal/wal.go:802”,“msg”:“slow fdatasync”,“took”:“4.388946076s”,“expected-duration”:“1s”}
{“level”:“warn”,“ts”:“2023-09-18T22:35:31.521Z”,“caller”:“wal/wal.go:802”,“msg”:“slow fdatasync”,“took”:“3.444079168s”,“expected-duration”:“1s”}
{“level”:“info”,“ts”:“2023-09-19T00:09:15.101Z”,“caller”:“etcdserver/server.go:1368”,“msg”:“triggering snapshot”,“local-member-id”:“634888d6d9dce83c”,“local-member-applied-index”:43900439,“local-member-snapshot-index”:43800438,“local-member-snapshot-count”:100000}
{“level”:“info”,“ts”:“2023-09-19T00:09:15.155Z”,“caller”:“etcdserver/server.go:2363”,“msg”:“saved snapshot”,“snapshot-index”:43900439}
{“level”:“info”,“ts”:“2023-09-19T00:09:15.155Z”,“caller”:“etcdserver/server.go:2393”,“msg”:“compacted Raft logs”,“compact-index”:43895439}
{“level”:“info”,“ts”:“2023-09-19T00:09:28.269Z”,“caller”:“fileutil/purge.go:77”,“msg”:“purged”,“path”:“data.etcd/member/snap/0000000000000044-0000000002963cf2.snap”}
Node2(Leader): PATRONI Log::
2023-09-18 21:39:48,328 ERROR: Request to server http://etcd1:2379 failed: MaxRetryError(‘HTTPConnectionPool(host='etcd1', port=2379): Max retries exceeded with url: /v2/keys/service/timescaledb/members/timescaledb-2 (Caused by ReadTimeoutError(“HTTPConnectionPool(host='etcd1', port=2379): Read timed out. (read timeout=3.3331781973441443)”))’)
2023-09-18 21:39:48,328 ERROR:
Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 445, in _make_request
six.raise_from(e, None)
File “”, line 3, in raise_from
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 440, in _make_request
httplib_response = conn.getresponse()
File “/usr/lib/python3.10/http/client.py”, line 1374, in getresponse
response.begin()
File “/usr/lib/python3.10/http/client.py”, line 318, in begin
version, status, reason = self._read_status()
File “/usr/lib/python3.10/http/client.py”, line 279, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), “iso-8859-1”)
File “/usr/lib/python3.10/socket.py”, line 705, in readinto
return self._sock.recv_into(b)
TimeoutError: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 699, in urlopen
httplib_response = self._make_request(
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 447, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 336, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host=‘etcd1’, port=2379): Read timed out. (read timeout=3.3331781973441443)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/patroni/dcs/etcd.py”, line 211, in _do_http_request
response = request_executor(method, base_uri + path, **kwargs)
File “/usr/lib/python3/dist-packages/urllib3/request.py”, line 78, in request
return self.request_encode_body(
File “/usr/lib/python3/dist-packages/urllib3/request.py”, line 170, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File “/usr/lib/python3/dist-packages/urllib3/poolmanager.py”, line 375, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File “/usr/lib/python3/dist-packages/urllib3/connectionpool.py”, line 755, in urlopen
retries = retries.increment(
File “/usr/lib/python3/dist-packages/urllib3/util/retry.py”, line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host=‘etcd1’, port=2379): Max retries exceeded with url: /v2/keys/service/timescaledb/members/timescaledb-2 (Caused by ReadTimeoutError(“HTTPConnectionPool(host=‘etcd1’, port=2379): Read timed out. (read timeout=3.3331781973441443)”))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/patroni/dcs/etcd.py”, line 566, in wrapper
retval = func(self, *args, **kwargs) is not None
File “/usr/lib/python3/dist-packages/patroni/dcs/etcd.py”, line 659, in touch_member
return self._client.set(self.member_path, data, None if permanent else self._ttl)
File “/usr/lib/python3/dist-packages/etcd/client.py”, line 721, in set
return self.write(key, value, ttl=ttl)
File “/usr/lib/python3/dist-packages/etcd/client.py”, line 500, in write
response = self.api_execute(path, method, params=params)
File “/usr/lib/python3/dist-packages/patroni/dcs/etcd.py”, line 256, in api_execute
response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
File “/usr/lib/python3/dist-packages/patroni/dcs/etcd.py”, line 230, in _do_http_request
raise etcd.EtcdException(‘{0} {1} request failed’.format(method, path))
etcd.EtcdException: PUT /v2/keys/service/timescaledb/members/timescaledb-2 request failed
2023-09-18 21:40:05,153 ERROR: Failed to drop replication slot ‘timescaledb_3’
2023-09-18 21:40:15,004 ERROR: Failed to drop replication slot ‘timescaledb_3’
2023-09-18 21:40:25,015 ERROR: Failed to drop replication slot ‘timescaledb_3’
Node1(Replica): TimescaleDB Log::
2023-09-18 22:35:41 UTC [3153198]: [6508d0bd.301d2e-1] @,app= [00000] LOG: started streaming WAL from primary at 529/9D000000 on timeline 50
2023-09-18 22:35:41 UTC [3153198]: [6508d0bd.301d2e-2] @,app= [XX000] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000032000005290000009D has already been removed
Node2(Leader): TimescaleDB Log::
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=...** port=35148
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-2] postgres@[unknown],app=[unknown] [00000] LOG: replication connection authorized: user=postgres application_name=timescaledb-1
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-3] postgres@[unknown],app=timescaledb-1 [58P01] ERROR: requested WAL segment 00000032000005290000009D has already been removed
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-4] postgres@[unknown],app=timescaledb-1 [58P01] STATEMENT: START_REPLICATION SLOT “timescaledb_1” 529/9D000000 TIMELINE 50
2023-09-18 22:39:57 UTC [400820]: [6508d1bd.61db4-5] postgres@[unknown],app=timescaledb-1 [00000] LOG: disconnection: session time: 0:00:00.009 user=postgres database= host=...* port=35148
– –
Thanks & Regards