What are the best practices for compress_segmentby?

yannis · October 23, 2023, 12:52pm

Segment by columns are really important for both achieving higher compression rates, but also for speeding up queries on compressed chunks.

Compression is a balancing act and you’ll achieve the most if you can keep the compressed batches as full as possible. As you can read on the link I shared above, we cluster the data based on the segment by values and then compress all the tuples in each batch together. You want to get as close to 1000 tuples per batch as possible for optimal performance.

You want to select segment by columns that have 2 main characteristics:

Group your data in a natural way → better compression, locality and query performance when querying using filters on the segment by column(s)
Each unique combination of values results in ~1000 tuples or more on your uncompressed chunk → better compression and locality

From that perspective, you don’t want values that result to too many records (e.g. col1 in your example) as they do not really offer that much on their own. But you don’t want on the other side to pick a column with very high cardinality as well as you may end up with too few records per segment by. The counter example on this side would be picking a unique column for the segment by, which would result to 1 record per compressed batch; you would compress a tuple on its own, achieving zero gains from compression (negative gains in reality as we’d have to add metadata for compression)