This is a question about application design and fitting Postgres concepts into modern workflows.
I have a single Timescale database running in Kubernetes with High Availability (using the brilliant Timescale-Single Helm chart). I have a Python application which is also HA, receiving streaming data via websockets and then writing this data into TimescaleDB.
I want to avoid gaps in the streamed data (websocket outages, app crashes, etc.) and so my current strategy is to have the Python app replicated x3, with each app receiving the same data streams and attempting to write to the same tables. On the Timescale side, the tables being written to have UNIQUE constraints to ensure that the python apps cannot write the same data twice.
This creates the obvious problem of deadlocks. Having read up on them a little, the general consensus seems to be that deadlocks are the result of application design flaws and so I’m open to suggestions on ways to improve on my current strategy. I have tried breaking the python app into smaller components, such that each app can only write to a single table (i.e. 3 python apps per table… very resource hungry!), but I am still getting deadlocks. On the python side, I am able to configure the max number of buffered transactions to write at once. I’m not sure what I should be looking at on the Postgres side and I definitely don’t know how I can have more than one Python app performing the same role without clashes!
Trying to insert the same data from multiple sources with a unique constraint is typically an anti-pattern. From an architectural perspective, it seems like there’s a missing piece further up the pipeline. Two quick patterns that many people user are:
Apache Kafka: Stream your data to Kafka and let it deal with the streaming inserts to TimescaleDB. There’s more nuance here, but just a general statement that this is (one of) the strengths/purposes of Kafka and certainly worth a look.
Use another messaging queue of some sort to receive the data and then use the multiple Python apps to pull messages from the queue and insert them. If something goes wrong, the message (should) be returned to the queue so that nothing gets lost. But in this architecture, each Python app/worker can pull a batch of messages (ie. 100 messages to insert as a batch of row data), but no worker is ever inserting the same data.
I know of numerous applications using architectures similar to this and handling millions of messages an hour.
One of the downsides of working solo and being self taught is that I tend to fall into every anti-pattern trap possible! So I really appreciate your insights and experience.
I’ve already started researching deduplication with Kafka Streams and it looks super interesting, especially as it might coincidentally solve another issue: continuous aggregates of continuous aggregates (which I believe is not yet possible in TimescaleDB), i.e. use Kafka Streams to aggregate the raw data and present those aggregations to Timescale for further processing. Nice!
@ryanbooz, sorry, maybe I have misunderstood the concept with Kafka. Are you suggesting to:
use some Kafka-related stream processing to deduplicate the records before insertion (e.g. a processing pipeline with some additional product like Samza or Kafka Streams) and thereby avoid the requirement for UNIQUE checks, or
use only Kafka in order to present the complete stream of records - duplicates n’all - to TimescaleDB and continue with Postgres/UNIQUE strategy for deduplication?
Having slept on it, it seems that your intention was (2) as this would create the ‘single source’ and thereby negate the anti-pattern of having multiple sources. And it is certainly way easier to implement than (1)!
I probably misunderstood exactly how your application is set up. It sounded like you were purposefully sending the same data to multiple backends to ensure you didn’t lose any data. If there’s natural duplication in the data (maybe something like an IoT device that might store batches when offline and they get sent a few times), then yes, you’d have to deal with that in the stream of data before it gets to TimescaleDB/Postgres.
In the end, however, UNIQUE constraints on high-throughput data is challenging and so dealing with that upstream would be the goal - essentially #1.
Thanks @ryanbooz. Your assumption is exactly right; I have duplicated the websocket data source x3 to ensure no losses due to outages on its journey across the public internet.
Having spent a few days researching the topic, it seems that de-duping in Kafka is non-trivial. It’s certainly way harder than setting a simple UNIQUE constraint, which just works! In terms of computational complexity, is there any benefit to switching to a INSERT... ON CONFLICT statement instead? From what I understand, both methods can use a btree index on a specific chunk so there shouldn’t be any advantage/benefit in either method. Is this assumption correct?
I’ve never done duplicate inserts quite exactly like this but when writing from dozens of python containers to thousands of mutually overlapping tables I avoided deadlocks by sorting the tables’ order per-transaction. In this way table mutation occurs along a DAG and while a transaction can block another one, it won’t deadlock because if they conflicted on a prior (lower sorted) lock/table they would have blocked there instead. Idk if that’s possible for your data but that’s my Timescale deadlock avoidance strategy.