Title: | Text Tokenization using Byte Pair Encoding and Unigram Modelling |
---|---|
Description: | Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>. Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>. |
Authors: | Jan Wijffels [aut, cre, cph] (R wrapper), BNOSAC [cph] (R wrapper), Google Inc. [ctb, cph] (Files at src/sentencepiece/src (Apache License, Version 2.0), The Abseil Authors [ctb, cph] (Files at src/third_party/absl (Apache License, Version 2.0), Google Inc. [ctb, cph] (Files at src/third_party/protobuf-lite (BSD-3 License)), Kenton Varda (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc, generated_message_util.cc, generated_message_util.cc, message_lite.cc, repeated_field.cc, wire_format_lite.cc, zero_copy_stream.cc, zero_copy_stream_impl_lite.cc, google/protobuf/extension_set.h, google/protobuf/generated_message_util.h, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h, google/protobuf/repeated_field.h, google/protobuf/io/coded_stream.h, google/protobuf/io/zero_copy_stream_impl_lite.h, google/protobuf/io/zero_copy_stream.h, google/protobuf/stubs/common.h, google/protobuf/stubs/hash.h, google/protobuf/stubs/once.h, google/protobuf/stubs/once.h.org (BSD-3 License)), Sanjay Ghemawat (Google Inc.) [ctb, cph] (Design of files at src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc, generated_message_util.cc, generated_message_util.cc, message_lite.cc, repeated_field.cc, wire_format_lite.cc, zero_copy_stream.cc, zero_copy_stream_impl_lite.cc, google/protobuf/extension_set.h, google/protobuf/generated_message_util.h, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h, google/protobuf/repeated_field.h, google/protobuf/io/coded_stream.h, google/protobuf/io/zero_copy_stream_impl_lite.h, google/protobuf/io/zero_copy_stream.h (BSD-3 License)), Jeff Dean (Google Inc.) [ctb, cph] (Design of files at src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc, generated_message_util.cc, generated_message_util.cc, message_lite.cc, repeated_field.cc, wire_format_lite.cc, zero_copy_stream.cc, zero_copy_stream_impl_lite.cc, google/protobuf/extension_set.h, google/protobuf/generated_message_util.h, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h, google/protobuf/repeated_field.h, google/protobuf/io/coded_stream.h, google/protobuf/io/zero_copy_stream_impl_lite.h, google/protobuf/io/zero_copy_stream.h (BSD-3 License)), Laszlo Csomor (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: io_win32.cc, google/protobuf/stubs/io_win32.h (BSD-3 License)), Wink Saville (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: message_lite.cc, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h (BSD-3 License)), Jim Meehan (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: structurally_valid.cc (BSD-3 License)), Chris Atenasio (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: google/protobuf/wire_format_lite.h (BSD-3 License)), Jason Hsueh (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: google/protobuf/io/coded_stream_inl.h (BSD-3 License)), Anton Carver (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: google/protobuf/stubs/map_util.h (BSD-3 License)), Maxim Lifantsev (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: google/protobuf/stubs/mathlimits.h (BSD-3 License)), Susumu Yata [ctb, cph] (Files at src/third_party/darts_clone (BSD-3 License), Daisuke Okanohara [ctb, cph] (File src/third_party/esaxx/esa.hxx (MIT License)), Yuta Mori [ctb, cph] (File src/third_party/esaxx/sais.hxx (MIT License)), Benjamin Heinzerling [ctb, cph] (Files data/models/nl.wiki.bpe.vs1000.d25.w2v.txt, data/models/nl.wiki.bpe.vs1000.d25.w2v.bin and data/models/nl.wiki.bpe.vs1000.model (MIT License)) |
Maintainer: | Jan Wijffels <[email protected]> |
License: | MPL-2.0 |
Version: | 0.2.3 |
Built: | 2024-11-06 04:41:41 UTC |
Source: | https://github.com/bnosac/sentencepiece |
Use a sentencepiece model to tokenise text and get the embeddings of these
BPEembed( file_sentencepiece = x$file_model, file_word2vec = x$glove.bin$file_model, x, normalize = TRUE )
BPEembed( file_sentencepiece = x$file_model, file_word2vec = x$glove.bin$file_model, x, normalize = TRUE )
file_sentencepiece |
the path to the file containing the sentencepiece model |
file_word2vec |
the path to the file containing the word2vec embeddings |
x |
the result of a call to |
normalize |
passed on to |
an object of class BPEembed which is a list with elements
model: a sentencepiece model as loaded with sentencepiece_load_model
embedding: a matrix with embeddings as loaded with read.wordvectors
dim: the dimension of the embedding
n: the number of elements in the vocabulary
file_sentencepiece: the sentencepiece model file
file_word2vec: the word2vec embedding file
predict.BPEembed
, sentencepiece_load_model
, sentencepiece_download_model
, read.wordvectors
## ## Example loading model from disk ## folder <- system.file(package = "sentencepiece", "models") embedding <- file.path(folder, "nl.wiki.bpe.vs1000.d25.w2v.bin") model <- file.path(folder, "nl.wiki.bpe.vs1000.model") encoder <- BPEembed(model, embedding) ## Do tokenisation with the sentencepiece model + embed these txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.", "On est d'accord sur le prix de la biere?") values <- predict(encoder, txt, type = "encode") str(values) values txt <- rownames(values[[1]]) predict(encoder, txt, type = "decode") txt <- lapply(values, FUN = rownames) predict(encoder, txt, type = "decode")
## ## Example loading model from disk ## folder <- system.file(package = "sentencepiece", "models") embedding <- file.path(folder, "nl.wiki.bpe.vs1000.d25.w2v.bin") model <- file.path(folder, "nl.wiki.bpe.vs1000.model") encoder <- BPEembed(model, embedding) ## Do tokenisation with the sentencepiece model + embed these txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.", "On est d'accord sur le prix de la biere?") values <- predict(encoder, txt, type = "encode") str(values) values txt <- rownames(values[[1]]) predict(encoder, txt, type = "decode") txt <- lapply(values, FUN = rownames) predict(encoder, txt, type = "decode")
Build a sentencepiece model on text and build a matching word2vec model on the sentencepiece vocabulary
BPEembedder( x, tokenizer = c("bpe", "char", "unigram", "word"), args = list(vocab_size = 8000, coverage = 0.9999), ... )
BPEembedder( x, tokenizer = c("bpe", "char", "unigram", "word"), args = list(vocab_size = 8000, coverage = 0.9999), ... )
x |
a data.frame with columns doc_id and text |
tokenizer |
character string with the type of sentencepiece tokenizer. Either 'bpe', 'char', 'unigram' or 'word' for Byte Pair Encoding, Character level encoding,
Unigram encoding or pretokenised word encoding. Defaults to 'bpe' (Byte Pair Encoding). Passed on to |
args |
a list of arguments passed on to |
... |
arguments passed on to |
an object of class BPEembed which is a list with elements
model: a sentencepiece model as loaded with sentencepiece_load_model
embedding: a matrix with embeddings as loaded with read.wordvectors
dim: the dimension of the embedding
n: the number of elements in the vocabulary
file_sentencepiece: the sentencepiece model file
file_word2vec: the word2vec embedding file
sentencepiece
, word2vec
, predict.BPEembed
library(tokenizers.bpe) data(belgium_parliament, package = "tokenizers.bpe") x <- subset(belgium_parliament, language %in% "dutch") model <- BPEembedder(x, tokenizer = "bpe", args = list(vocab_size = 1000), type = "cbow", dim = 20, iter = 10) model txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.") values <- predict(model, txt, type = "encode")
library(tokenizers.bpe) data(belgium_parliament, package = "tokenizers.bpe") x <- subset(belgium_parliament, language %in% "dutch") model <- BPEembedder(x, tokenizer = "bpe", args = list(vocab_size = 1000), type = "cbow", dim = 20, iter = 10) model txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.") values <- predict(model, txt, type = "encode")
Use the sentencepiece model to either
encode: tokenise and embed text
decode: get the untokenised text back of tokenised data
tokenize: only tokenize alongside the sentencepiece model
## S3 method for class 'BPEembed' predict(object, newdata, type = c("encode", "decode", "tokenize"), ...)
## S3 method for class 'BPEembed' predict(object, newdata, type = c("encode", "decode", "tokenize"), ...)
object |
an object of class BPEembed as returned by |
newdata |
a character vector of text to encode or a character vector of encoded tokens to decode or a list of those |
type |
character string, either 'encode', 'decode' or 'tokenize' |
... |
further arguments passed on to the methods |
in case type is set to 'encode'
: a list of matrices containing embeddings of the text which is tokenised with sentencepiece_encode
in case type is set to 'decode'
: a character vector of decoded text as returned by sentencepiece_decode
in case type is set to 'tokenize'
: a tokenised sentencepiece_encode
BPEembed
, sentencepiece_decode
, sentencepiece_encode
embedding <- system.file(package = "sentencepiece", "models", "nl.wiki.bpe.vs1000.d25.w2v.bin") model <- system.file(package = "sentencepiece", "models", "nl.wiki.bpe.vs1000.model") encoder <- BPEembed(model, embedding) txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.", "On est d'accord sur le prix de la biere?") values <- predict(encoder, txt, type = "encode") str(values) values txt <- rownames(values[[1]]) predict(encoder, txt, type = "decode") txt <- lapply(values, FUN = rownames) predict(encoder, txt, type = "decode") txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.", "On est d'accord sur le prix de la biere?") predict(encoder, txt, type = "tokenize", "subwords") predict(encoder, txt, type = "tokenize", "ids")
embedding <- system.file(package = "sentencepiece", "models", "nl.wiki.bpe.vs1000.d25.w2v.bin") model <- system.file(package = "sentencepiece", "models", "nl.wiki.bpe.vs1000.model") encoder <- BPEembed(model, embedding) txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.", "On est d'accord sur le prix de la biere?") values <- predict(encoder, txt, type = "encode") str(values) values txt <- rownames(values[[1]]) predict(encoder, txt, type = "decode") txt <- lapply(values, FUN = rownames) predict(encoder, txt, type = "decode") txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.", "On est d'accord sur le prix de la biere?") predict(encoder, txt, type = "tokenize", "subwords") predict(encoder, txt, type = "tokenize", "ids")
Read a word2vec embedding file as a dense matrix. This uses read.wordvectors
from the word2vec package.
read_word2vec( x, type = c("txt", "bin"), n = .Machine$integer.max, encoding = "UTF-8", normalize = TRUE )
read_word2vec( x, type = c("txt", "bin"), n = .Machine$integer.max, encoding = "UTF-8", normalize = TRUE )
x |
path to the file |
type |
either 'bin' or 'txt' indicating the |
n |
integer, indicating to limit the number of words to read in. Defaults to reading all words. |
encoding |
encoding to be assumed for the words. Defaults to 'UTF-8' |
normalize |
logical indicating to normalize the embeddings by dividing by the factor (sqrt(sum(x . x) / length(x))). Defaults to TRUE. |
a matrix with one row per token containing the embedding of the token
folder <- system.file(package = "sentencepiece", "models") embedding <- file.path(folder, "nl.wiki.bpe.vs1000.d25.w2v.bin") embedding <- read_word2vec(embedding, type = "bin") head(embedding) embedding <- file.path(folder, "nl.wiki.bpe.vs1000.d25.w2v.txt") embedding <- read_word2vec(embedding, type = "txt") head(embedding, n = 10)
folder <- system.file(package = "sentencepiece", "models") embedding <- file.path(folder, "nl.wiki.bpe.vs1000.d25.w2v.bin") embedding <- read_word2vec(embedding, type = "bin") head(embedding) embedding <- file.path(folder, "nl.wiki.bpe.vs1000.d25.w2v.txt") embedding <- read_word2vec(embedding, type = "txt") head(embedding, n = 10)
Construct a Sentencepiece model on text.
sentencepiece( x, type = c("bpe", "char", "unigram", "word"), vocab_size = 8000, coverage = 0.9999, model_prefix = "sentencepiece", model_dir = tempdir(), threads = 1L, args, verbose = FALSE )
sentencepiece( x, type = c("bpe", "char", "unigram", "word"), vocab_size = 8000, coverage = 0.9999, model_prefix = "sentencepiece", model_dir = tempdir(), threads = 1L, args, verbose = FALSE )
x |
a character vector of path(s) to the text files containing training data |
type |
either one of 'bpe', 'char', 'unigram' or 'word' for Byte Pair Encoding, Character level encoding, Unigram encoding or pretokenised word encoding. Defaults to 'bpe' (Byte Pair Encoding). |
vocab_size |
integer indicating the number of tokens in the final vocabulary. Defaults to 8000. |
coverage |
fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999. |
model_prefix |
character string with the name of the model. Defaults to 'sentencepiece'.
When executing the function 2 files will be created in the directory specified by |
model_dir |
directory where the model will be saved. Defaults to the temporary directory (tempdir()) |
threads |
integer indicating number of threads to use when building the model |
args |
character string with arguments passed on to sentencepiece::SentencePieceTrainer::Train (for expert use only) |
verbose |
logical indicating to show progress of sentencepiece training. Defaults to |
an object of class sentencepiece
which is defined at sentencepiece_load_model
library(tokenizers.bpe) data(belgium_parliament, package = "tokenizers.bpe") path <- "traindata.txt" folder <- getwd() writeLines(belgium_parliament$text, con = path) model <- sentencepiece(path, type = "char", model_dir = folder, verbose = TRUE) model <- sentencepiece(path, type = "unigram", vocab_size = 20000, model_dir = folder, verbose = TRUE) model <- sentencepiece(path, type = "bpe", vocab_size = 4000, model_dir = folder, verbose = TRUE) txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.", "On est d'accord sur le prix de la biere?") sentencepiece_encode(model, x = txt, type = "subwords") sentencepiece_encode(model, x = txt, type = "ids") model <- sentencepiece_load_model(file.path(folder, "sentencepiece.model")) sentencepiece_encode(model, x = txt, type = "subwords") sentencepiece_encode(model, x = txt, type = "ids")
library(tokenizers.bpe) data(belgium_parliament, package = "tokenizers.bpe") path <- "traindata.txt" folder <- getwd() writeLines(belgium_parliament$text, con = path) model <- sentencepiece(path, type = "char", model_dir = folder, verbose = TRUE) model <- sentencepiece(path, type = "unigram", vocab_size = 20000, model_dir = folder, verbose = TRUE) model <- sentencepiece(path, type = "bpe", vocab_size = 4000, model_dir = folder, verbose = TRUE) txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.", "On est d'accord sur le prix de la biere?") sentencepiece_encode(model, x = txt, type = "subwords") sentencepiece_encode(model, x = txt, type = "ids") model <- sentencepiece_load_model(file.path(folder, "sentencepiece.model")) sentencepiece_encode(model, x = txt, type = "subwords") sentencepiece_encode(model, x = txt, type = "ids")
Decode a sequence of Sentencepiece ids into text again
sentencepiece_decode(model, x)
sentencepiece_decode(model, x)
model |
an object of class |
x |
an integer vector of Sentencepiece id's or a list of these |
a character vector of detokenised text or if you encoded with nbest
, a list of these
model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer.model") model <- sentencepiece_load_model(file = model) txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.", "On est d'accord sur le prix de la biere?") x <- sentencepiece_encode(model, x = txt, type = "subwords") sentencepiece_decode(model, x) x <- sentencepiece_encode(model, x = txt, type = "ids") sentencepiece_decode(model, x) model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer-unigram.model") model <- sentencepiece_load_model(file = model) x <- sentencepiece_encode(model, x = txt, type = "subwords", nbest = 3) sentencepiece_decode(model, x) x <- sentencepiece_encode(model, x = txt, type = "subwords", nbest = 3, alpha = 0.1) sentencepiece_decode(model, x)
model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer.model") model <- sentencepiece_load_model(file = model) txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.", "On est d'accord sur le prix de la biere?") x <- sentencepiece_encode(model, x = txt, type = "subwords") sentencepiece_decode(model, x) x <- sentencepiece_encode(model, x = txt, type = "ids") sentencepiece_decode(model, x) model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer-unigram.model") model <- sentencepiece_load_model(file = model) x <- sentencepiece_encode(model, x = txt, type = "subwords", nbest = 3) sentencepiece_decode(model, x) x <- sentencepiece_encode(model, x = txt, type = "subwords", nbest = 3, alpha = 0.1) sentencepiece_decode(model, x)
Download pretrained models built on Wikipedia made available at https://bpemb.h-its.org through https://github.com/bheinzerling/bpemb. These models contain Byte Pair Encoded models trained with sentencepiece as well as Glove embeddings of these Byte Pair subwords. Models for 275 languages are available.
sentencepiece_download_model( language, vocab_size, dim, model_dir = system.file(package = "sentencepiece", "models") )
sentencepiece_download_model( language, vocab_size, dim, model_dir = system.file(package = "sentencepiece", "models") )
language |
a character string with the language name. This can be either a plain language or a wikipedia shorthand. |
vocab_size |
integer indicating the number of tokens in the final vocabulary. Defaults to 5000. Possible values depend on the language. To inspect possible values, type sentencepiece:::.bpemb$vocab_sizes and look to your language of your choice. |
dim |
dimension of the embedding. Either 25, 50, 100, 200 or 300. |
model_dir |
path to the location where the model will be downloaded to. Defaults to |
a list with elements
language: the provided language
wikicode: the wikipedia code of the provided language
file_model: the path to the downloaded Sentencepiece model
url: the url where the Sentencepiece model was fetched from
download_failed: logical, indicating if the download failed
download_message: a character string with possible download failure information
glove: a list with elements file_model, url, download_failed and download_message indicating the path to the Glove embeddings in txt format. Only present if the dim argument is provided in the function. Otherwise the embeddings will not be downloaded
glove.bin: a list with elements file_model, url, download_failed and download_message indicating the path to the Glove embeddings in bin format. Only present if the dim argument is provided in the function. Otherwise the embeddings will not be downloaded
path <- getwd() ## ## Download only the tokeniser model ## dl <- sentencepiece_download_model("Russian", vocab_size = 50000, model_dir = path) dl <- sentencepiece_download_model("English", vocab_size = 100000, model_dir = path) dl <- sentencepiece_download_model("French", vocab_size = 25000, model_dir = path) dl <- sentencepiece_download_model("multi", vocab_size = 320000, model_dir = path) dl <- sentencepiece_download_model("Vlaams", vocab_size = 1000, model_dir = path) dl <- sentencepiece_download_model("Dutch", vocab_size = 25000, model_dir = path) dl <- sentencepiece_download_model("nl", vocab_size = 25000, model_dir = path) str(dl) model <- sentencepiece_load_model(dl$file_model) ## ## Download the tokeniser model + Glove embeddings of Byte Pairs ## dl <- sentencepiece_download_model("nl", vocab_size = 1000, dim = 50, model_dir = path) str(dl) model <- sentencepiece_load_model(dl$file_model) embedding <- read_word2vec(dl$glove$file_model) dl <- sentencepiece_download_model("nl", vocab_size = 1000, dim = 25, model_dir = tempdir()) str(dl)
path <- getwd() ## ## Download only the tokeniser model ## dl <- sentencepiece_download_model("Russian", vocab_size = 50000, model_dir = path) dl <- sentencepiece_download_model("English", vocab_size = 100000, model_dir = path) dl <- sentencepiece_download_model("French", vocab_size = 25000, model_dir = path) dl <- sentencepiece_download_model("multi", vocab_size = 320000, model_dir = path) dl <- sentencepiece_download_model("Vlaams", vocab_size = 1000, model_dir = path) dl <- sentencepiece_download_model("Dutch", vocab_size = 25000, model_dir = path) dl <- sentencepiece_download_model("nl", vocab_size = 25000, model_dir = path) str(dl) model <- sentencepiece_load_model(dl$file_model) ## ## Download the tokeniser model + Glove embeddings of Byte Pairs ## dl <- sentencepiece_download_model("nl", vocab_size = 1000, dim = 50, model_dir = path) str(dl) model <- sentencepiece_load_model(dl$file_model) embedding <- read_word2vec(dl$glove$file_model) dl <- sentencepiece_download_model("nl", vocab_size = 1000, dim = 25, model_dir = tempdir()) str(dl)
Tokenise text alongside a Sentencepiece model
sentencepiece_encode( model, x, type = c("subwords", "ids"), nbest = -1L, alpha = 0.1 )
sentencepiece_encode( model, x, type = c("subwords", "ids"), nbest = -1L, alpha = 0.1 )
model |
an object of class |
x |
a character vector of text (in UTF-8 Encoding) |
type |
a character string, either 'subwords' or 'ids' to get the subwords or the corresponding ids of these subwords as defined in the vocabulary of the model. Defaults to 'subwords'. |
nbest |
integer indicating the number of segmentations to extract. See the details. The argument is not used if you do not provide a value for it. |
alpha |
smoothing parameter to perform subword regularisation. Typical values are 0.1, 0.2 or 0.5. See the details. The argument is not used if you do not provide a value for it or do not provide a value for |
If you specify alpha
to perform subword regularisation, keep in mind the following.
When alpha is 0.0, one segmentation is uniformly sampled from the nbest
or lattice.
The best Viterbi segmentation is more likely sampled when setting larger alpha
values like 0.1.
If you provide a positive value for nbest
, approximately samples one segmentation from nbest
candidates.
If you provide a negative value for nbest
, samples one segmentation from the hypotheses (Lattice) according to the generation probabilities using forward-filtering and backward-sampling algorithm.
nbest
and alpha
correspond respectively to the parameter l
and in alpha
in the paper https://arxiv.org/abs/1804.10959 where (nbest
< 0 means l = infinity).
If the model is a BPE model, alpha
is the merge probability p
explained in https://arxiv.org/abs/1910.13267.
In a BPE model, nbest-based sampling is not supported so the nbest parameter is ignored although
it still needs to be provided if you want to make use of alpha
.
a list with tokenised text, one for each element of x
unless you provide nbest
without providing alpha
in which case the result is a list of list of nbest
tokenised texts
model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer.model") model <- sentencepiece_load_model(file = model) txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.", "On est d'accord sur le prix de la biere?") sentencepiece_encode(model, x = txt, type = "subwords") sentencepiece_encode(model, x = txt, type = "ids") ## Examples using subword regularisation model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer-unigram.model") model <- sentencepiece_load_model(file = model) txt <- c("Goed zo", "On est d'accord") sentencepiece_encode(model, x = txt, type = "subwords", nbest = 4) sentencepiece_encode(model, x = txt, type = "ids", nbest = 4) sentencepiece_encode(model, x = txt, type = "subwords", nbest = 2) sentencepiece_encode(model, x = txt, type = "ids", nbest = 2) sentencepiece_encode(model, x = txt, type = "subwords", nbest = 1) sentencepiece_encode(model, x = txt, type = "ids", nbest = 1) sentencepiece_encode(model, x = txt, type = "subwords", nbest = 4, alpha = 0.1) sentencepiece_encode(model, x = txt, type = "ids", nbest = 4, alpha = 0.1) sentencepiece_encode(model, x = txt, type = "subwords", nbest = -1, alpha = 0.1) sentencepiece_encode(model, x = txt, type = "ids", nbest = -1, alpha = 0.1) sentencepiece_encode(model, x = txt, type = "subwords", nbest = -1, alpha = 0) sentencepiece_encode(model, x = txt, type = "ids", nbest = -1, alpha = 0)
model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer.model") model <- sentencepiece_load_model(file = model) txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.", "On est d'accord sur le prix de la biere?") sentencepiece_encode(model, x = txt, type = "subwords") sentencepiece_encode(model, x = txt, type = "ids") ## Examples using subword regularisation model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer-unigram.model") model <- sentencepiece_load_model(file = model) txt <- c("Goed zo", "On est d'accord") sentencepiece_encode(model, x = txt, type = "subwords", nbest = 4) sentencepiece_encode(model, x = txt, type = "ids", nbest = 4) sentencepiece_encode(model, x = txt, type = "subwords", nbest = 2) sentencepiece_encode(model, x = txt, type = "ids", nbest = 2) sentencepiece_encode(model, x = txt, type = "subwords", nbest = 1) sentencepiece_encode(model, x = txt, type = "ids", nbest = 1) sentencepiece_encode(model, x = txt, type = "subwords", nbest = 4, alpha = 0.1) sentencepiece_encode(model, x = txt, type = "ids", nbest = 4, alpha = 0.1) sentencepiece_encode(model, x = txt, type = "subwords", nbest = -1, alpha = 0.1) sentencepiece_encode(model, x = txt, type = "ids", nbest = -1, alpha = 0.1) sentencepiece_encode(model, x = txt, type = "subwords", nbest = -1, alpha = 0) sentencepiece_encode(model, x = txt, type = "ids", nbest = -1, alpha = 0)
Load a Sentencepiece model which either was trained with sentencepiece
or which you have found in the wild.
sentencepiece_load_model(file = "sentencepiece.model")
sentencepiece_load_model(file = "sentencepiece.model")
file |
path to the file containing the Sentencepiece model |
an object of class sentencepiece
which is a list with elements
model: an Rcpp pointer to the model
model_path: the path to the model
vocab_size: the size of the Sentencepiece vocabulary
vocabulary: the Sentencepiece vocabulary which is a data.frame with columns id and subword
model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer.model") model <- sentencepiece_load_model(file = model) txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.", "On est d'accord sur le prix de la biere?") sentencepiece_encode(model, x = txt, type = "subwords") sentencepiece_encode(model, x = txt, type = "ids")
model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer.model") model <- sentencepiece_load_model(file = model) txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.", "On est d'accord sur le prix de la biere?") sentencepiece_encode(model, x = txt, type = "subwords") sentencepiece_encode(model, x = txt, type = "ids")
Remove prefixed underscore unicode character 'LOWER ONE EIGHTH BLOCK' (U+2581)
txt_remove_(x, replacement = "")
txt_remove_(x, replacement = "")
x |
a character vector |
replacement |
character string how to replace the underscore. Defaults to the empty string. |
x
where the prefixed underscore is removed
x <- c("\u2581word", "hello", "_regularunderscore") x txt_remove_(x)
x <- c("\u2581word", "hello", "_regularunderscore") x txt_remove_(x)
Wordpiece encoding, usefull for BERT-style tokenisation. Experimental version mimicing class WordpieceTokenizer from https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/tokenization_bert.py
wordpiece_encode( x, vocabulary = character(), type = c("subwords", "ids"), unk_token = "[UNK]", max_input_chars_per_word = 100L )
wordpiece_encode( x, vocabulary = character(), type = c("subwords", "ids"), unk_token = "[UNK]", max_input_chars_per_word = 100L )
x |
a character vector with text which can be splitted based on white space to obtain words |
vocabulary |
a character vector of the vocabulary |
type |
a character string, either 'subwords' or 'ids' to get the subwords or the corresponding ids of these subwords as defined in the vocabulary of the model. Defaults to 'subwords'. |
unk_token |
character string with a value for a token which is not part of the vocabulary. Defaults to '[UNK]' |
max_input_chars_per_word |
integer. A word which is longer than this specified number of characters will be set to the unknown token. |
a list of subword tokens
wordpiece_encode("unaffable", vocabulary = c("un", "##aff", "##able")) wordpiece_encode(x = c("unaffable", "unaffableun"), vocabulary = c("un", "##aff", "##able")) wordpiece_encode(x = c("unaffable", "unaffableun", "unknown territory"), vocabulary = c("un", "##aff", "##able", "##un")) wordpiece_encode(x = c("unaffable", "unaffableun", "unknown territory"), vocabulary = c("un", "##aff", "##able", "##un"), type = "ids")
wordpiece_encode("unaffable", vocabulary = c("un", "##aff", "##able")) wordpiece_encode(x = c("unaffable", "unaffableun"), vocabulary = c("un", "##aff", "##able")) wordpiece_encode(x = c("unaffable", "unaffableun", "unknown territory"), vocabulary = c("un", "##aff", "##able", "##un")) wordpiece_encode(x = c("unaffable", "unaffableun", "unknown territory"), vocabulary = c("un", "##aff", "##able", "##un"), type = "ids")