Package: sentencepiece 0.2.5
sentencepiece: Text Tokenization using Byte Pair Encoding and Unigram Modelling
Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>. Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>.
Authors:
sentencepiece_0.2.5.tar.gz
sentencepiece_0.2.5.zip(r-4.7)sentencepiece_0.2.5.zip(r-4.6)sentencepiece_0.2.5.zip(r-4.5)
sentencepiece_0.2.5.tgz(r-4.6-x86_64)sentencepiece_0.2.5.tgz(r-4.6-arm64)sentencepiece_0.2.5.tgz(r-4.5-x86_64)sentencepiece_0.2.5.tgz(r-4.5-arm64)
sentencepiece_0.2.5.tar.gz(r-4.7-arm64)sentencepiece_0.2.5.tar.gz(r-4.7-x86_64)sentencepiece_0.2.5.tar.gz(r-4.6-arm64)sentencepiece_0.2.5.tar.gz(r-4.6-x86_64)
sentencepiece_0.2.5.tgz(r-4.6-emscripten)
manual.pdf |manual.html✨
card.svg |card.png
sentencepiece/json (API)
NEWS
| # Install 'sentencepiece' in R: |
| install.packages('sentencepiece', repos = c('https://bnosac.r-universe.dev', 'https://cloud.r-project.org')) |
Bug tracker:https://github.com/bnosac/sentencepiece/issues
bytenatural-language-processingsentencepieceword-segmentationcpp
Last updated from:7e67419ddb. Checks:13 OK. Indexed: yes.
| Target | Result | Time | Files | Syslog |
|---|---|---|---|---|
| linux-devel-arm64 | OK | 284 | ||
| linux-devel-x86_64 | OK | 269 | ||
| source / vignettes | OK | 234 | ||
| linux-release-arm64 | OK | 278 | ||
| linux-release-x86_64 | OK | 266 | ||
| macos-release-arm64 | OK | 270 | ||
| macos-release-x86_64 | OK | 440 | ||
| macos-oldrel-arm64 | OK | 222 | ||
| macos-oldrel-x86_64 | OK | 558 | ||
| windows-devel | OK | 486 | ||
| windows-release | OK | 432 | ||
| windows-oldrel | OK | 479 | ||
| wasm-release | OK | 171 |
Exports:BPEembedBPEembedderread_word2vecsentencepiecesentencepiece_decodesentencepiece_download_modelsentencepiece_encodesentencepiece_load_modeltxt_remove_wordpiece_encode
Dependencies:Rcpp
Readme and manuals
Help Manual
| Help page | Topics |
|---|---|
| Tokenise and embed text alongside a Sentencepiece and Word2vec model | BPEembed |
| Build a BPEembed model containing a Sentencepiece and Word2vec model | BPEembedder |
| Encode and Decode alongside a BPEembed model | predict.BPEembed |
| Read a word2vec embedding file | read_word2vec |
| Construct a Sentencepiece model | sentencepiece |
| Decode encoded sequences back to text | sentencepiece_decode |
| Download a Sentencepiece model | sentencepiece_download_model |
| Tokenise text alongside a Sentencepiece model | sentencepiece_encode |
| Load a Sentencepiece model | sentencepiece_load_model |
| Remove prefixed underscore | txt_remove_ |
| Wordpiece encoding | wordpiece_encode |
