Package: tokenizers.bpe 0.1.3

Jan Wijffels

tokenizers.bpe: Byte Pair Encoding Text Tokenization

Unsupervised text tokenizer focused on computational efficiency. Wraps the 'YouTokenToMe' library <https://github.com/VKCOM/YouTokenToMe> which is an implementation of fast Byte Pair Encoding (BPE) <https://aclanthology.org/P16-1162/>.

Authors:Jan Wijffels [aut, cre, cph], BNOSAC [cph], VK.com [cph], Gregory Popovitch [ctb, cph]

tokenizers.bpe_0.1.3.tar.gz
tokenizers.bpe_0.1.3.zip(r-4.5)tokenizers.bpe_0.1.3.zip(r-4.4)tokenizers.bpe_0.1.3.zip(r-4.3)
tokenizers.bpe_0.1.3.tgz(r-4.4-x86_64)tokenizers.bpe_0.1.3.tgz(r-4.4-arm64)tokenizers.bpe_0.1.3.tgz(r-4.3-x86_64)tokenizers.bpe_0.1.3.tgz(r-4.3-arm64)
tokenizers.bpe_0.1.3.tar.gz(r-4.5-noble)tokenizers.bpe_0.1.3.tar.gz(r-4.4-noble)
tokenizers.bpe_0.1.3.tgz(r-4.4-emscripten)tokenizers.bpe_0.1.3.tgz(r-4.3-emscripten)
tokenizers.bpe.pdf |tokenizers.bpe.html
tokenizers.bpe/json (API)
NEWS

# Install 'tokenizers.bpe' in R:
install.packages('tokenizers.bpe', repos = c('https://bnosac.r-universe.dev', 'https://cloud.r-project.org'))

Peer review:

Bug tracker:https://github.com/bnosac/tokenizers.bpe/issues

Uses libs:
  • c++– GNU Standard C++ Library v3
Datasets:
  • belgium_parliament - Dataset from 2017 with Questions asked in the Belgium Federal Parliament

On CRAN:

bpebyte-pair-encodingtext-miningtokenization

4.56 score 15 stars 48 scripts 310 downloads 1 mentions 4 exports 1 dependencies

Last updated 1 years agofrom:72ecec49fe. Checks:OK: 9. Indexed: yes.

TargetResultDate
Doc / VignettesOKOct 11 2024
R-4.5-win-x86_64OKOct 11 2024
R-4.5-linux-x86_64OKOct 11 2024
R-4.4-win-x86_64OKOct 11 2024
R-4.4-mac-x86_64OKOct 11 2024
R-4.4-mac-aarch64OKOct 11 2024
R-4.3-win-x86_64OKOct 11 2024
R-4.3-mac-x86_64OKOct 11 2024
R-4.3-mac-aarch64OKOct 11 2024

Exports:bpebpe_decodebpe_encodebpe_load_model

Dependencies:Rcpp