---
title: "Dictionaries"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Dictionaries}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = FALSE,
comment = "#>"
)
```
```{r setup}
library(lz4lite)
```
`LZ4` can make use of *dictionaries* to increase the compression
for small data e.g. ~2kB.
This is useful when you have many small repeating messages with similar structure
and content e.g. log messages.
This dictionary can be any raw vector, but original `LZ4` documentation suggests that using
`zstd` to create this dictionary is the best route.
**Note:** if data is serialized with a dictionary then that same dictionary must
be presented when unserializing.
This vignette demonstrates the creation of samples needed to train a dictionary.
## Overview
1. Create a directory of sample messages. One sample per file.
2. Use `zstd` to create the dictionary
3. Load this dictionary as a raw vector
3. Use the dictionary argument when calling `lz4_serialize()` and `lz4_unserialize()`.
## Create some data
This is a contrived example mimicking log messages. Each log message has a time
and an index and some text associated with it.
The following code generates 2000 message samples which will be used to
train a dictionary.
```{r eval=FALSE}
tmp <- "working/samples/"
dir.create(tmp, showWarnings = FALSE, recursive = TRUE)
template <- r"(
%s %i
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum
)"
for (i in 1:2000) {
msg <- sprintf(template, Sys.time(), i)
file <- sprintf("%s/%05i.txt", tmp, i)
writeLines(msg, file)
}
```
## Train a dictionary using `zstd`
Train a 4kB dictionary using `zstd`. Dictionary size will depende on the type
of message. A size of 64kB is quite common.
```{r eval=FALSE}
system("zstd --train working/samples/* -o working/test.dict --maxdict=4KB")
```
## Read the dictionary and use with `lz4_serialize()`/`lz4_unserialize()`
```{r eval=FALSE}
dict <- readBin("working/test.dict", raw(), file.size("working/test.dict"))
```
## Test the effect of a dictionary for a new message
When this dictionary is specified to compress a new message, the compressed
size is much smaller.
```{r eval = TRUE, echo = FALSE}
template <- r"(
%s %i: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum
)"
dict <- readBin("dict/test.dict", raw(), file.size("dict/test.dict"))
```
```{r eval = TRUE}
# Create a new message
msg <- sprintf(template, Sys.time(), 9999)
# No ditionary
lz4_serialize(msg) |> length()
# Compress with dictionary
lz4_serialize(msg, dict = dict) |> length()
```