LZ4 can make use of dictionaries to increase
the compression for small data e.g. ~2kB.
This is useful when you have many small repeating messages with similar structure and content e.g. log messages.
This dictionary can be any raw vector, but original LZ4
documentation suggests that using zstd to create this
dictionary is the best route.
Note: if data is serialized with a dictionary then that same dictionary must be presented when unserializing.
This vignette demonstrates the creation of samples needed to train a dictionary.
zstd to create the dictionarylz4_serialize() and lz4_unserialize().This is a contrived example mimicking log messages. Each log message has a time and an index and some text associated with it.
The following code generates 2000 message samples which will be used to train a dictionary.
tmp <- "working/samples/"
dir.create(tmp, showWarnings = FALSE, recursive = TRUE)
template <- r"(
%s %i
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum
)"
for (i in 1:2000) {
msg <- sprintf(template, Sys.time(), i)
file <- sprintf("%s/%05i.txt", tmp, i)
writeLines(msg, file)
}zstdTrain a 4kB dictionary using zstd. Dictionary size will
depende on the type of message. A size of 64kB is quite common.
lz4_serialize()/lz4_unserialize()