Package: sentencepiece 0.2.3
sentencepiece: Text Tokenization using Byte Pair Encoding and Unigram Modelling
Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>. Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>.
Authors:
sentencepiece_0.2.3.tar.gz
sentencepiece_0.2.3.zip(r-4.5)sentencepiece_0.2.3.zip(r-4.4)sentencepiece_0.2.3.zip(r-4.3)
sentencepiece_0.2.3.tgz(r-4.4-x86_64)sentencepiece_0.2.3.tgz(r-4.4-arm64)sentencepiece_0.2.3.tgz(r-4.3-x86_64)sentencepiece_0.2.3.tgz(r-4.3-arm64)
sentencepiece_0.2.3.tar.gz(r-4.5-noble)sentencepiece_0.2.3.tar.gz(r-4.4-noble)
sentencepiece_0.2.3.tgz(r-4.4-emscripten)sentencepiece_0.2.3.tgz(r-4.3-emscripten)
sentencepiece.pdf |sentencepiece.html✨
sentencepiece/json (API)
NEWS
# Install 'sentencepiece' in R: |
install.packages('sentencepiece', repos = c('https://bnosac.r-universe.dev', 'https://cloud.r-project.org')) |
Bug tracker:https://github.com/bnosac/sentencepiece/issues
bytenatural-language-processingsentencepieceword-segmentation
Last updated 2 years agofrom:95f86c692b. Checks:OK: 1 NOTE: 8. Indexed: yes.
Target | Result | Date |
---|---|---|
Doc / Vignettes | OK | Nov 06 2024 |
R-4.5-win-x86_64 | NOTE | Nov 06 2024 |
R-4.5-linux-x86_64 | NOTE | Nov 06 2024 |
R-4.4-win-x86_64 | NOTE | Nov 06 2024 |
R-4.4-mac-x86_64 | NOTE | Nov 06 2024 |
R-4.4-mac-aarch64 | NOTE | Nov 06 2024 |
R-4.3-win-x86_64 | NOTE | Nov 06 2024 |
R-4.3-mac-x86_64 | NOTE | Nov 06 2024 |
R-4.3-mac-aarch64 | NOTE | Nov 06 2024 |
Exports:BPEembedBPEembedderread_word2vecsentencepiecesentencepiece_decodesentencepiece_download_modelsentencepiece_encodesentencepiece_load_modeltxt_remove_wordpiece_encode
Dependencies:Rcpp
Readme and manuals
Help Manual
Help page | Topics |
---|---|
Tokenise and embed text alongside a Sentencepiece and Word2vec model | BPEembed |
Build a BPEembed model containing a Sentencepiece and Word2vec model | BPEembedder |
Encode and Decode alongside a BPEembed model | predict.BPEembed |
Read a word2vec embedding file | read_word2vec |
Construct a Sentencepiece model | sentencepiece |
Decode encoded sequences back to text | sentencepiece_decode |
Download a Sentencepiece model | sentencepiece_download_model |
Tokenise text alongside a Sentencepiece model | sentencepiece_encode |
Load a Sentencepiece model | sentencepiece_load_model |
Remove prefixed underscore | txt_remove_ |
Wordpiece encoding | wordpiece_encode |