Package: sentencepiece 0.2.5

Jan Wijffels

sentencepiece: Text Tokenization using Byte Pair Encoding and Unigram Modelling

Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>. Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>.

Authors:Jan Wijffels [aut, cre, cph], BNOSAC [cph], Google Inc. [ctb, cph], Kenton Varda [ctb, cph], Sanjay Ghemawat [ctb, cph], Jeff Dean [ctb, cph], Laszlo Csomor [ctb, cph], Wink Saville [ctb, cph], Jim Meehan [ctb, cph], Chris Atenasio [ctb, cph], Jason Hsueh [ctb, cph], Anton Carver [ctb, cph], Maxim Lifantsev [ctb, cph], Susumu Yata [ctb, cph], Yuta Mori [ctb, cph], Benjamin Heinzerling [ctb, cph]

sentencepiece_0.2.5.tar.gz
sentencepiece_0.2.5.zip(r-4.7)sentencepiece_0.2.5.zip(r-4.6)sentencepiece_0.2.5.zip(r-4.5)
sentencepiece_0.2.5.tgz(r-4.6-x86_64)sentencepiece_0.2.5.tgz(r-4.6-arm64)sentencepiece_0.2.5.tgz(r-4.5-x86_64)sentencepiece_0.2.5.tgz(r-4.5-arm64)
sentencepiece_0.2.5.tar.gz(r-4.7-arm64)sentencepiece_0.2.5.tar.gz(r-4.7-x86_64)sentencepiece_0.2.5.tar.gz(r-4.6-arm64)sentencepiece_0.2.5.tar.gz(r-4.6-x86_64)
sentencepiece_0.2.5.tgz(r-4.6-emscripten)
manual.pdf |manual.html
card.svg |card.png
sentencepiece/json (API)
NEWS

# Install 'sentencepiece' in R:
install.packages('sentencepiece', repos = c('https://bnosac.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/bnosac/sentencepiece/issues

Uses libs:
  • c++– GNU Standard C++ Library v3

On CRAN:

Conda:

bytenatural-language-processingsentencepieceword-segmentationcpp

4.50 score 29 stars 11 scripts 318 downloads 10 exports 1 dependencies

Last updated from:7e67419ddb. Checks:13 OK. Indexed: yes.

TargetResultTimeFilesSyslog
linux-devel-arm64OK284
linux-devel-x86_64OK269
source / vignettesOK234
linux-release-arm64OK278
linux-release-x86_64OK266
macos-release-arm64OK270
macos-release-x86_64OK440
macos-oldrel-arm64OK222
macos-oldrel-x86_64OK558
windows-develOK486
windows-releaseOK432
windows-oldrelOK479
wasm-releaseOK171

Exports:BPEembedBPEembedderread_word2vecsentencepiecesentencepiece_decodesentencepiece_download_modelsentencepiece_encodesentencepiece_load_modeltxt_remove_wordpiece_encode

Dependencies:Rcpp