Using a dictionary can be beneficial when compressing lots of small objects with similar structure or content.
Instead of the compressor starting from scratch for every new object, a dictionary can be trained and used such that there is some starting information common to all objects - it’s like giving the compressor a bit of a head-start.
The following notes presented below are from zstd
dictionary documentation:
Zstd can use dictionaries to improve compression ratio of small data. Traditionally small files don’t compress well because there is very little repetition in a single sample, since it is small. But, if you are compressing many similar files, like a bunch of JSON records that share the same structure, you can train a dictionary on ahead of time on some samples of these files. Then, zstd can use the dictionary to find repetitions that are present across samples. This can vastly improve compression ratio.
Dictionaries are useful when compressing many small files that are similar. The larger a file is, the less benefit a dictionary will have. Generally, we don’t expect dictionary compression to be effective past 100KB. And the smaller a file is, the more we would expect the dictionary to help.
Gather samples from your use case. These samples should be similar to each other. If you have several use cases, you could try to train one dictionary per use case. If the dictionary training function fails, that is likely because you either passed too few samples, or a dictionary would not be effective for your data.
A reasonable dictionary size, the dictBufferCapacity
, is
about 100KB. The zstd CLI defaults to a 110KB dictionary. You likely
don’t need a dictionary larger than that. But, most use cases can get
away with a smaller dictionary. The advanced dictionary builders can
automatically shrink the dictionary for you, and select the smallest
size that doesn’t hurt compression ratio too much. See the
shrinkDict
parameter. A smaller dictionary can save memory,
and potentially speed up compression.
We generally recommend passing ~100x the size of the dictionary in samples. A few thousand should suffice. Having too few samples can hurt the dictionaries effectiveness. Having more samples will only improve the dictionaries effectiveness. But having too many samples can slow down the dictionary builder.
Simply train a dictionary and try it out.
You should retrain a dictionary when its effectiveness drops. Dictionary effectiveness drops as the data you are compressing changes. Generally, we do expect dictionaries to “decay” over time, as your data changes, but the rate at which they decay depends on your use case. Internally, we regularly retrain dictionaries, and if the new dictionary performs significantly better than the old dictionary, we will ship the new dictionary.
The following shows that using a dictionary for this specific example gives ~35% smaller files in ~75% of the time.
set.seed(2024)
countries <- rownames(LifeCycleSavings)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create 'test' and 'train' datasets
# In this example consider the case of having a named vector of rankings of
# countries. Each ranking will be compressed separately and stored (say in a database)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
train_samples <- lapply(
1:1000,
\(x) setNames(sample(length(countries)), countries)
)
test_samples <- lapply(
1:1000,
\(x) setNames(sample(length(countries)), countries)
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create a dictionary
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dict <- zstd_train_dict_serialize(train_samples, size = 5000, optim = FALSE)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Setup Compression/Decompression contexts to use this dictionary
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cctx_nodict <- zstd_cctx(level = 3) # No dictionary. For comparison
cctx_dict <- zstd_cctx(level = 3, dict = dict)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# When using the dictionary, what is the size of the compressed data compared
# to not using a dicionary here?
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
s1 <- lapply(test_samples, \(x) zstd_serialize(x, cctx = cctx_nodict)) |> lengths() |> sum()
s2 <- lapply(test_samples, \(x) zstd_serialize(x, cctx = cctx_dict )) |> lengths() |> sum()
cat(round(s2/s1 * 100, 1), "%")
#> 62.9 %
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Simple benchmark to test speed when using dicionary.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bench::mark(
"No Dict" = lapply(test_samples, \(x) zstd_serialize(x, cctx = cctx_nodict)),
"Dict" = lapply(test_samples, \(x) zstd_serialize(x, cctx = cctx_dict )),
check = FALSE
)[, 1:5]
#> # A tibble: 2 × 5
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 No Dict 17.6ms 17.8ms 56.1 18MB
#> 2 Dict 13.3ms 13.6ms 73.0 18MB