Package: sentencepiece 0.2.3

Jan Wijffels

sentencepiece: Text Tokenization using Byte Pair Encoding and Unigram Modelling

Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>. Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>.

Authors:Jan Wijffels [aut, cre, cph], BNOSAC [cph], Google Inc. [ctb, cph], Kenton Varda [ctb, cph], Sanjay Ghemawat [ctb, cph], Jeff Dean [ctb, cph], Laszlo Csomor [ctb, cph], Wink Saville [ctb, cph], Jim Meehan [ctb, cph], Chris Atenasio [ctb, cph], Jason Hsueh [ctb, cph], Anton Carver [ctb, cph], Maxim Lifantsev [ctb, cph], Susumu Yata [ctb, cph], Yuta Mori [ctb, cph], Benjamin Heinzerling [ctb, cph]

# Install 'sentencepiece' in R:

install.packages('sentencepiece', repos = c('https://bnosac.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/bnosac/sentencepiece/issues

Uses libs:

c++– GNU Standard C++ Library v3

On CRAN:

byte natural-language-processing sentencepiece word-segmentation cpp

4.10 score 25 stars 8 scripts 407 downloads 10 exports 1 dependencies

Last updated 2 years agofrom:95f86c692b. Checks:1 OK, 11 NOTE. Indexed: yes.

Target	Result	Latest binary
Doc / Vignettes	OK	Mar 06 2025
R-4.5-win-x86_64	NOTE	Mar 06 2025
R-4.5-mac-x86_64	NOTE	Mar 06 2025
R-4.5-mac-aarch64	NOTE	Mar 06 2025
R-4.5-linux-x86_64	NOTE	Mar 06 2025
R-4.4-win-x86_64	NOTE	Mar 06 2025
R-4.4-mac-x86_64	NOTE	Mar 06 2025
R-4.4-mac-aarch64	NOTE	Mar 06 2025
R-4.4-linux-x86_64	NOTE	Mar 06 2025
R-4.3-win-x86_64	NOTE	Mar 06 2025
R-4.3-mac-x86_64	NOTE	Mar 06 2025
R-4.3-mac-aarch64	NOTE	Mar 06 2025

Exports:BPEembed BPEembedder read_word2vec sentencepiece sentencepiece_decode sentencepiece_download_model sentencepiece_encode sentencepiece_load_model txt_remove_wordpiece_encode

Dependencies:Rcpp

Help page	Topics
Tokenise and embed text alongside a Sentencepiece and Word2vec model	BPEembed
Build a BPEembed model containing a Sentencepiece and Word2vec model	BPEembedder
Encode and Decode alongside a BPEembed model	predict.BPEembed
Read a word2vec embedding file	read_word2vec
Construct a Sentencepiece model	sentencepiece
Decode encoded sequences back to text	sentencepiece_decode
Download a Sentencepiece model	sentencepiece_download_model
Tokenise text alongside a Sentencepiece model	sentencepiece_encode
Load a Sentencepiece model	sentencepiece_load_model
Remove prefixed underscore	txt_remove_
Wordpiece encoding	wordpiece_encode

Package: sentencepiece 0.2.3

sentencepiece: Text Tokenization using Byte Pair Encoding and Unigram Modelling

Citation

Development and contributors

Readme and manuals

Help Manual

Usage by other packages (reverse dependencies)