Title: | Fast Compression and Serialization with 'Zstandard' Algorithm |
---|---|
Description: | Fast, compressed serialization of R objects using the 'Zstandard' algorithm. The included zstandard connection ('zstdfile()') can be used to read/write compressed data by any code which supports R's built-in 'connections' mechanism. Dictionaries are supported for more effective compression of small data, and functions are provided for training these dictionaries. This implementation provides an R interface to advanced features of the 'Zstandard' 'C' library (available from <https://github.com/facebook/zstd>). |
Authors: | Mike Cheng [aut, cre, cph], Yann Collet [aut] (Author of the embedded Zstandard library), Meta Platforms, Inc. and affiliates. [cph] (Facebook is the copyright holder of the bundled Zstandard library) |
Maintainer: | Mike Cheng <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.10 |
Built: | 2025-01-13 08:26:18 UTC |
Source: | https://github.com/coolbutuseless/zstdlite |
Compression contexts can be re-used, meaning that they don't have to be created each time a compression function is called. This can make things faster when performing multiple compression operations.
zstd_cctx(level = 3L, num_threads = 1L, include_checksum = FALSE, dict = NULL)
zstd_cctx(level = 3L, num_threads = 1L, include_checksum = FALSE, dict = NULL)
level |
Compression level. Default: 3. Valid range is [-5, 22] with
-5 representing the mode with least compression and 22
representing the mode with most compression. Note |
num_threads |
Number of compression threads. Default 1. Using more threads can result in faster compression, but the magnitude of this speed-up depends on lots of factors e.g. cpu, drive speed, type of data compression level etc. |
include_checksum |
Include a checksum with the compressed data?
Default: FALSE. If |
dict |
Dictionary. Default: NULL. Can either be a raw vector or a filename.
This dictionary can be created with |
External pointer to a ZSTD Compression Context which can be passed to
zstd_serialize()
and zstd_compress()
cctx <- zstd_cctx(level = 4)
cctx <- zstd_cctx(level = 4)
Get the configuration settings of a compression context
zstd_cctx_settings(cctx)
zstd_cctx_settings(cctx)
cctx |
ZSTD compression context, as created by |
named list of configuration options
cctx <- zstd_cctx() zstd_cctx_settings(cctx)
cctx <- zstd_cctx() zstd_cctx_settings(cctx)
This function is appropriate when handling data from other systems e.g.
data compressed with the zstd
command-line, or other compression
programs.
zstd_compress(x, ..., dst = NULL, cctx = NULL, use_file_streaming = FALSE) zstd_decompress( src, type = "raw", ..., dctx = NULL, use_file_streaming = FALSE )
zstd_compress(x, ..., dst = NULL, cctx = NULL, use_file_streaming = FALSE) zstd_decompress( src, type = "raw", ..., dctx = NULL, use_file_streaming = FALSE )
x |
Data to be compressed. This may be a raw vector, or a character string |
... |
extra arguments passed to |
dst |
destination in which to write the compressed data. If |
cctx |
ZSTD Compression Context created by |
use_file_streaming |
Use the streaming interface when reading or writing to a file? This may reduce memory allocations and make better use of mutlithreading. Default: FALSE |
src |
Source from which compressed data is read. If a string,
then this will be the filename to read data from. |
type |
Should data be returned as a 'raw' vector or as a 'string'? Default: 'raw' |
dctx |
ZSTD Decompression Context created by |
Raw vector of compressed data, or NULL
if file created with compressed data
# With raw vectors dat <- sample(as.raw(1:10), 1000, replace = TRUE) vec <- zstd_compress(x = dat) zstd_decompress(src = vec) # With files tmp <- tempfile() zstd_compress(x = dat, dst = tmp) zstd_decompress(src = tmp) # With connections tmp <- tempfile() zstd_compress(x = dat, dst = file(tmp)) zstd_decompress(src = file(tmp))
# With raw vectors dat <- sample(as.raw(1:10), 1000, replace = TRUE) vec <- zstd_compress(x = dat) zstd_decompress(src = vec) # With files tmp <- tempfile() zstd_compress(x = dat, dst = tmp) zstd_decompress(src = tmp) # With connections tmp <- tempfile() zstd_compress(x = dat, dst = file(tmp)) zstd_decompress(src = file(tmp))
Decompression contexts can be re-used, meaning that they don't have to be created each time a decompression function is called. This can make things faster when performing multiple decompression operations.
zstd_dctx(validate_checksum = TRUE, dict = NULL)
zstd_dctx(validate_checksum = TRUE, dict = NULL)
validate_checksum |
If a checksum is present on the comrpessed data,
should the checksum be validated?
Default: TRUE. Set to |
dict |
Dictionary. Default: NULL. Can either be a raw vector or a filename.
This dictionary can be created with |
External pointer to a ZSTD Decompression Context which can be passed to
zstd_unserialize()
and zstd_decompress()
dctx <- zstd_dctx(validate_checksum = FALSE)
dctx <- zstd_dctx(validate_checksum = FALSE)
Get the configuration settings of a decompression context
zstd_dctx_settings(dctx)
zstd_dctx_settings(dctx)
dctx |
ZSTD decompression context, as created by |
named list of configuration options
dctx <- zstd_dctx() zstd_dctx_settings(dctx)
dctx <- zstd_dctx() zstd_dctx_settings(dctx)
Dictionary IDs are generated automatically when a dictionary is created. When using a dictionary for compression, the same dictionary must be used during decompression. ZSTD internally does this check for matching IDs when attempting to decompress. This function exposes the dictionary ID to aid in handling and tracking dictionaries in R.
zstd_dict_id(dict)
zstd_dict_id(dict)
dict |
raw vector or filename. This object could contain either a zstd dictionary, or a compressed object. If it is a compressed object, then it will return the dictionary id which was used to compress it. |
Signed integer value representing the Dictionary ID. If data does not represent a dictionary, or data which was compressed with a dictionary, then a value of 0 is returned.
dict_file <- system.file("sample_dict.raw", package = "zstdlite", mustWork = TRUE) dict <- readBin(dict_file, raw(), file.size(dict_file)) zstd_dict_id(dict) compressed_mtcars <- zstd_serialize(mtcars, dict = dict) zstd_dict_id(compressed_mtcars)
dict_file <- system.file("sample_dict.raw", package = "zstdlite", mustWork = TRUE) dict <- readBin(dict_file, raw(), file.size(dict_file)) zstd_dict_id(dict) compressed_mtcars <- zstd_serialize(mtcars, dict = dict) zstd_dict_id(compressed_mtcars)
Return information about the zstd stream
zstd_info(src)
zstd_info(src)
src |
raw vector, file or connection |
named list with compressed_size
, uncompressed_size
,
dict_id
and has_checksum
. If an error occurs, or
the data does not appear to represent Zstandard compressed data,
function returns NULL
data <- as.raw(sample(1:2, 10000, replace = TRUE)) cdata <- zstd_compress(data) zstd_info(cdata)
data <- as.raw(sample(1:2, 10000, replace = TRUE)) cdata <- zstd_compress(data) zstd_info(cdata)
Serialize/Unserialize arbitrary R objects to a compressed stream of bytes using Zstandard
zstd_serialize(robj, ..., dst = NULL, cctx = NULL, use_file_streaming = FALSE) zstd_unserialize(src, ..., dctx = NULL, use_file_streaming = FALSE)
zstd_serialize(robj, ..., dst = NULL, cctx = NULL, use_file_streaming = FALSE) zstd_unserialize(src, ..., dctx = NULL, use_file_streaming = FALSE)
robj |
Any R object understood by |
... |
extra arguments passed to |
dst |
filename in which to serialize data. If NULL (the default), then serialize the results to a raw vector |
cctx |
ZSTD Compression Context created by |
use_file_streaming |
Use the streaming interface when reading or writing to a file? This may reduce memory allocations and make better use of mutlithreading. Default: FALSE |
src |
Raw vector or filename containing a ZSTD compressed serialized representation of an R object |
dctx |
ZSTD Decompression Context created by |
Raw vector of compressed serialized data, or NULL
if file
created with compressed data
# Raw vector vec <- zstd_serialize(mtcars) zstd_unserialize(src = vec) # file tmp <- tempfile() zstd_serialize(mtcars, dst = tmp) zstd_unserialize(src = tmp) # connection tmp <- tempfile() zstd_serialize(mtcars, dst = file(tmp)) zstd_unserialize(src = file(tmp))
# Raw vector vec <- zstd_serialize(mtcars) zstd_unserialize(src = vec) # file tmp <- tempfile() zstd_serialize(mtcars, dst = tmp) zstd_unserialize(src = tmp) # connection tmp <- tempfile() zstd_serialize(mtcars, dst = file(tmp)) zstd_unserialize(src = file(tmp))
zstd_compress()
and zstd_decompress()
This function requires multiple samples representative of the expected data to train a dictionary for use during compression.
zstd_train_dict_compress( samples, size = 1e+05, optim = FALSE, optim_shrink_allow = 0 )
zstd_train_dict_compress( samples, size = 1e+05, optim = FALSE, optim_shrink_allow = 0 )
samples |
list of raw vectors, or length-1 character vectors.
Each raw vector or string, should be a complete
example of something to be compressed with |
size |
Maximum size of dictionary in bytes. Default: 112640 (110 kB)
matches the default size set by the command line version of |
optim |
optimize the dictionary. Default FALSE. If TRUE, then ZSTD will spend time optimizing the dictionary. This can be a very length operation. |
optim_shrink_allow |
integer value representing a percentage.
If non-zero, then a search will be carried out for dictionaries of a
smaller size which are up to |
raw vector containing a ZSTD dictionary
# This example shows the mechanics of creating and training a dictionary but # may not be a great example of when a dictionary might be useful cars <- rownames(mtcars) samples <- lapply(seq_len(1000), \(x) serialize(sample(cars), NULL)) zstd_train_dict_compress(samples, size = 5000)
# This example shows the mechanics of creating and training a dictionary but # may not be a great example of when a dictionary might be useful cars <- rownames(mtcars) samples <- lapply(seq_len(1000), \(x) serialize(sample(cars), NULL)) zstd_train_dict_compress(samples, size = 5000)
zstd_serialize()
and zstd_unserialize()
Train a dictionary for use with zstd_serialize()
and zstd_unserialize()
zstd_train_dict_serialize( samples, size = 1e+05, optim = FALSE, optim_shrink_allow = 0 )
zstd_train_dict_serialize( samples, size = 1e+05, optim = FALSE, optim_shrink_allow = 0 )
samples |
list of example R objects to train a dictionary to be
used with |
size |
Maximum size of dictionary in bytes. Default: 112640 (110 kB)
matches the default size set by the command line version of |
optim |
optimize the dictionary. Default FALSE. If TRUE, then ZSTD will spend time optimizing the dictionary. This can be a very length operation. |
optim_shrink_allow |
integer value representing a percentage.
If non-zero, then a search will be carried out for dictionaries of a
smaller size which are up to |
raw vector containing a ZSTD dictionary
# This example shows the mechanics of creating and training a dictionary but # may not be a great example of when a dictionary might be useful cars <- rownames(mtcars) samples <- lapply(seq_len(1000), \(x) sample(cars)) zstd_train_dict_serialize(samples, size = 5000)
# This example shows the mechanics of creating and training a dictionary but # may not be a great example of when a dictionary might be useful cars <- rownames(mtcars) samples <- lapply(seq_len(1000), \(x) sample(cars)) zstd_train_dict_serialize(samples, size = 5000)
Get version string of zstd C library
zstd_version()
zstd_version()
String containing version number of zstd C library
zstd_version()
zstd_version()
Create a file connection which uses Zstandard compression.
zstdfile(description, open = "", ..., cctx = NULL, dctx = NULL)
zstdfile(description, open = "", ..., cctx = NULL, dctx = NULL)
description |
zstandard filename |
open |
character string. A description of how to open the connection if
it is to be opened upon creation e.g. "rb". Default "" (empty string) means
to not open the connection on creation - user must still call |
... |
Other named arguments which override the contexts e.g. |
cctx , dctx
|
compression/decompression contexts created by
|
This zstdfile()
connection works like R's built-in connections (e.g.
gzfile()
, xzfile()
) but using the Zstandard algorithm
to compress/decompress the data.
This connection works with both ASCII and binary data, e.g. using
readLines()
and readBin()
.
# Binary tmp <- tempfile() dat <- as.raw(1:255) writeBin(dat, zstdfile(tmp, level = 20)) readBin(zstdfile(tmp), raw(), 1000) # Text tmp <- tempfile() txt <- as.character(mtcars) writeLines(txt, zstdfile(tmp)) readLines(zstdfile(tmp))
# Binary tmp <- tempfile() dat <- as.raw(1:255) writeBin(dat, zstdfile(tmp, level = 20)) readBin(zstdfile(tmp), raw(), 1000) # Text tmp <- tempfile() txt <- as.character(mtcars) writeLines(txt, zstdfile(tmp)) readLines(zstdfile(tmp))