Title: | Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit |
---|---|
Description: | This natural language processing toolkit provides language-agnostic 'tokenization', 'parts of speech tagging', 'lemmatization' and 'dependency parsing' of raw text. Next to text parsing, the package also allows you to train annotation models based on data of 'treebanks' in 'CoNLL-U' format as provided at <https://universaldependencies.org/format.html>. The techniques are explained in detail in the paper: 'Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe', available at <doi:10.18653/v1/K17-3009>. The toolkit also contains functionalities for commonly used data manipulations on texts which are enriched with the output of the parser. Namely functionalities and algorithms for collocations, token co-occurrence, document term matrix handling, term frequency inverse document frequency calculations, information retrieval metrics (Okapi BM25), handling of multi-word expressions, keyword detection (Rapid Automatic Keyword Extraction, noun phrase extraction, syntactical patterns) sentiment scoring and semantic similarity analysis. |
Authors: | Jan Wijffels [aut, cre, cph], BNOSAC [cph], Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic [cph], Milan Straka [ctb, cph], Jana Straková [ctb, cph] |
Maintainer: | Jan Wijffels <[email protected]> |
License: | MPL-2.0 |
Version: | 0.8.11 |
Built: | 2024-11-06 04:59:41 UTC |
Source: | https://github.com/bnosac/udpipe |
If you have a data.frame with annotations containing 1 row per token, you can convert it to CONLL-U format with this function. The data frame is required to have the following columns: doc_id, sentence_id, sentence, token_id, token and optionally has the following columns: lemma, upos, xpos, feats, head_token_id, dep_rel, deps, misc. Where these fields have the following meaning
doc_id: the identifier of the document
sentence_id: the identifier of the sentence
sentence: the text of the sentence for which this token is part of
token_id: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes.
token: Word form or punctuation symbol.
lemma: Lemma or stem of word form.
upos: Universal part-of-speech tag.
xpos: Language-specific part-of-speech tag; underscore if not available.
feats: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
head_token_id: Head of the current word, which is either a value of token_id or zero (0).
dep_rel: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
deps: Enhanced dependency graph in the form of a list of head-deprel pairs.
misc: Any other annotation.
The tokens in the data.frame should be ordered as they appear in the sentence.
as_conllu(x)
as_conllu(x)
x |
a data.frame with columns doc_id, sentence_id, sentence, token_id, token, lemma, upos, xpos, feats, head_token_id, deprel, dep_rel, misc |
a character string of length 1 containing the data.frame in CONLL-U format. See the example. You can easily save this to disk for processing in other applications.
https://universaldependencies.org/format.html
file_conllu <- system.file(package = "udpipe", "dummydata", "traindata.conllu") x <- udpipe_read_conllu(file_conllu) str(x) conllu <- as_conllu(x) cat(conllu) ## Not run: ## Write it to file, making sure it is in UTF-8 cat(as_conllu(x), file = file("annotations.conllu", encoding = "UTF-8")) ## End(Not run) ## Some fields are not mandatory, they will assummed to be NA conllu <- as_conllu(x[, c('doc_id', 'sentence_id', 'sentence', 'token_id', 'token', 'upos')]) cat(conllu)
file_conllu <- system.file(package = "udpipe", "dummydata", "traindata.conllu") x <- udpipe_read_conllu(file_conllu) str(x) conllu <- as_conllu(x) cat(conllu) ## Not run: ## Write it to file, making sure it is in UTF-8 cat(as_conllu(x), file = file("annotations.conllu", encoding = "UTF-8")) ## End(Not run) ## Some fields are not mandatory, they will assummed to be NA conllu <- as_conllu(x[, c('doc_id', 'sentence_id', 'sentence', 'token_id', 'token', 'upos')]) cat(conllu)
Use this function to convert the cells of a matrix to a co-occurrence data.frame containing fields term1, term2 and cooc where each row of the resulting data.frame contains the value of a cell in the matrix if the cell is not empty.
as_cooccurrence(x)
as_cooccurrence(x)
x |
a matrix or sparseMatrix |
a data.frame with columns term1, term2 and cooc where the data in cooc contain the content of the cells in the matrix for the combination of term1 and term2
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, language == "nl") dtm <- document_term_frequencies(x = x, document = "doc_id", term = "token") dtm <- document_term_matrix(dtm) correlation <- dtm_cor(dtm) cooc <- as_cooccurrence(correlation) head(cooc)
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, language == "nl") dtm <- document_term_frequencies(x = x, document = "doc_id", term = "token") dtm <- document_term_matrix(dtm) correlation <- dtm_cor(dtm) cooc <- as_cooccurrence(correlation) head(cooc)
Fasttext prepends a label or different labels to text using a special string (__label__). This function takes a character vector of text and prepends the labels alongside the special string.
as_fasttext(x, y, label = "__label__")
as_fasttext(x, y, label = "__label__")
x |
a character vector |
y |
a character vector of labels or a list of labels. |
label |
the string to use to prepend to the label. Defaults to __label__ |
a character vector of text where x
and y
are combined
as_fasttext(x = c("just a bit of txt", "example2", "more txt please", "more"), y = c("pos", "neg", "neg", NA)) as_fasttext(x = c("just a bit of txt", "example2", "more txt please", "more"), y = list(c("ok", "pos"), c("neg", "topic2"), "", NA))
as_fasttext(x = c("just a bit of txt", "example2", "more txt please", "more"), y = c("pos", "neg", "neg", NA)) as_fasttext(x = c("just a bit of txt", "example2", "more txt please", "more"), y = list(c("ok", "pos"), c("neg", "topic2"), "", NA))
Noun phrases are of common interest when doing natural language processing. Extracting noun phrases
from text can be done easily by defining a sequence of Parts of Speech tags. For example this sequence of POS tags
can be seen as a noun phrase: Adjective, Noun, Preposition, Noun.
This function recodes Universal POS tags to one of the following 1-letter tags, in order to simplify writing regular expressions
to find Parts of Speech sequences:
A: adjective
C: coordinating conjuction
D: determiner
M: modifier of verb
N: noun or proper noun
P: preposition
O: other elements
After which identifying a simple noun phrase can be just expressed by using the following regular expression (A|N)*N(P+D*(A|N)*N)* which basically says start with adjective or noun, another noun, a preposition, determiner adjective or noun and next a noun again.
as_phrasemachine(x, type = c("upos", "penn-treebank"))
as_phrasemachine(x, type = c("upos", "penn-treebank"))
x |
a character vector of POS tags for example by using |
type |
either 'upos' or 'penn-treebank' indicating to recode Universal Parts of Speech tags to the counterparts as described in the description, or to recode Parts of Speech tags as known in the Penn Treebank to the counterparts as described in the description |
For more information on extracting phrases see http://brenocon.com/handler2016phrases.pdf
the character vector x
where the respective POS tags are replaced with one-letter tags
x <- c("PROPN", "SCONJ", "ADJ", "NOUN", "VERB", "INTJ", "DET", "VERB", "PROPN", "AUX", "NUM", "NUM", "X", "SCONJ", "PRON", "PUNCT", "ADP", "X", "PUNCT", "AUX", "PROPN", "ADP", "X", "PROPN", "ADP", "DET", "CCONJ", "INTJ", "NOUN", "PROPN") as_phrasemachine(x)
x <- c("PROPN", "SCONJ", "ADJ", "NOUN", "VERB", "INTJ", "DET", "VERB", "PROPN", "AUX", "NUM", "NUM", "X", "SCONJ", "PRON", "PUNCT", "ADP", "X", "PUNCT", "AUX", "PROPN", "ADP", "X", "PROPN", "ADP", "DET", "CCONJ", "INTJ", "NOUN", "PROPN") as_phrasemachine(x)
The word2vec format provides in the first line the dimension of the word vectors and in the following lines one
has the elements of the wordvector where each line covers one word or token.
The function is basically a utility function which allows one to write wordvectors created with other R packages in
the well-known word2vec format which is used by udpipe_train
to train the dependency parser.
as_word2vec(x)
as_word2vec(x)
x |
a matrix with word vectors where the rownames indicate the word or token and the number of columns of the matrix indicate the side of the word vector |
a character string of length 1 containing the word vectors in word2vec format which can be written to a file on disk
wordvectors <- matrix(rnorm(1000), nrow = 100, ncol = 10) rownames(wordvectors) <- sprintf("word%s", seq_len(nrow(wordvectors))) wv <- as_word2vec(wordvectors) cat(wv) f <- file(tempfile(fileext = ".txt"), encoding = "UTF-8") cat(wv, file = f) close(f)
wordvectors <- matrix(rnorm(1000), nrow = 100, ncol = 10) rownames(wordvectors) <- sprintf("word%s", seq_len(nrow(wordvectors))) wv <- as_word2vec(wordvectors) cat(wv) f <- file(tempfile(fileext = ".txt"), encoding = "UTF-8") cat(wv, file = f) close(f)
Convert the result of udpipe_annotate
to a tidy data frame
## S3 method for class 'udpipe_connlu' as.data.frame(x, ...)
## S3 method for class 'udpipe_connlu' as.data.frame(x, ...)
x |
an object of class |
... |
currently not used |
a data.frame with columns doc_id, paragraph_id, sentence_id, sentence,
token_id, token, lemma, upos, xpos, feats, head_token_id, dep_rel, deps, misc
The columns paragraph_id, sentence_id are integers, the other fields
are character data in UTF-8 encoding.
To get more information on these fields, visit https://universaldependencies.org/format.html
or look at udpipe
.
model <- udpipe_download_model(language = "dutch-lassysmall") if(!model$download_failed){ ud_dutch <- udpipe_load_model(model$file_model) txt <- c("Ik ben de weg kwijt, kunt u me zeggen waar de Lange Wapper ligt? Jazeker meneer", "Het gaat vooruit, het gaat verbazend goed vooruit") x <- udpipe_annotate(ud_dutch, x = txt) x <- as.data.frame(x) head(x) } ## cleanup for CRAN only - you probably want to keep your model if you have downloaded it if(file.exists(model$file_model)) file.remove(model$file_model)
model <- udpipe_download_model(language = "dutch-lassysmall") if(!model$download_failed){ ud_dutch <- udpipe_load_model(model$file_model) txt <- c("Ik ben de weg kwijt, kunt u me zeggen waar de Lange Wapper ligt? Jazeker meneer", "Het gaat vooruit, het gaat verbazend goed vooruit") x <- udpipe_annotate(ud_dutch, x = txt) x <- as.data.frame(x) head(x) } ## cleanup for CRAN only - you probably want to keep your model if you have downloaded it if(file.exists(model$file_model)) file.remove(model$file_model)
Convert the result of cooccurrence
to a sparse matrix.
## S3 method for class 'cooccurrence' as.matrix(x, ...)
## S3 method for class 'cooccurrence' as.matrix(x, ...)
x |
an object of class |
... |
not used |
a sparse matrix with in the rows and columns the terms and in the cells how many times the cooccurrence occurred
data(brussels_reviews_anno) ## By document, which lemma's co-occur x <- subset(brussels_reviews_anno, xpos %in% c("NN", "JJ") & language %in% "fr") x <- cooccurrence(x, group = "doc_id", term = "lemma") x <- as.matrix(x) dim(x) x[1:3, 1:3]
data(brussels_reviews_anno) ## By document, which lemma's co-occur x <- subset(brussels_reviews_anno, xpos %in% c("NN", "JJ") & language %in% "fr") x <- cooccurrence(x, group = "doc_id", term = "lemma") x <- as.matrix(x) dim(x) x[1:3, 1:3]
Brussels AirBnB address locations available at www.insideairbnb.com
More information: http://insideairbnb.com/get-the-data.html
Data has been converted from UTF-8 to ASCII as in iconv(x, from = "UTF-8", to = "ASCII//TRANSLIT")
in order
to be able to comply to CRAN policies.
http://insideairbnb.com/brussels: information of 2015-10-03
brussels_reviews
, brussels_reviews_anno
data(brussels_listings) head(brussels_listings)
data(brussels_listings) head(brussels_listings)
Reviews of AirBnB customers on Brussels address locations available at www.insideairbnb.com
More information: http://insideairbnb.com/get-the-data.html.
The data contains 500 reviews in Spanish, 500 reviews in French and 500 reviews in Dutch.
The data frame contains the field id (unique), listing_id which corresponds to the listing_id of
the brussels_listings
dataset and text fields feedback and language (identified with package cld2)
Data has been converted from UTF-8 to ASCII as in iconv(x, from = "UTF-8", to = "ASCII//TRANSLIT")
in order
to be able to comply to CRAN policies.
http://insideairbnb.com/brussels: information of 2015-10-03
brussels_listings
, brussels_reviews_anno
data(brussels_reviews) str(brussels_reviews) head(brussels_reviews)
data(brussels_reviews) str(brussels_reviews) head(brussels_reviews)
Reviews of the AirBnB customerswhich are tokenised, POS tagged and lemmatised.
The data contains 1 row per document/token and contains the fields
doc_id, language, sentence_id, token_id, token, lemma, xpos.
Data has been converted from UTF-8 to ASCII as in iconv(x, from = "UTF-8", to = "ASCII//TRANSLIT")
in order
to be able to comply to CRAN policies.
http://insideairbnb.com/brussels: information of 2015-10-03
brussels_reviews
, brussels_listings
## brussels_reviews_anno data(brussels_reviews_anno) head(brussels_reviews_anno) sort(table(brussels_reviews_anno$xpos)) ## Not run: ## ## If you want to construct a similar dataset as the ## brussels_reviews_anno dataset based on the udpipe library, do as follows ## library(udpipe) library(data.table) data(brussels_reviews) ## The brussels_reviews contains comments on Airbnb sites in 3 languages: es, fr and nl table(brussels_reviews$language) bxl_anno <- split(brussels_reviews, brussels_reviews$language) ## Annotate the Spanish comments m <- udpipe_download_model(language = "spanish-ancora") m <- udpipe_load_model(file = m$file_model) bxl_anno$es <- udpipe_annotate(object = m, x = bxl_anno$es$feedback, doc_id = bxl_anno$es$id) ## Annotate the French comments m <- udpipe_download_model(language = "french-partut") m <- udpipe_load_model(file = m$file_model) bxl_anno$fr <- udpipe_annotate(object = m, x = bxl_anno$fr$feedback, doc_id = bxl_anno$fr$id) ## Annotate the Dutch comments m <- udpipe_download_model(language = "dutch-lassysmall") m <- udpipe_load_model(file = m$file_model) bxl_anno$nl <- udpipe_annotate(object = m, x = bxl_anno$nl$feedback, doc_id = bxl_anno$nl$id) brussels_reviews_anno <- lapply(bxl_anno, as.data.frame) brussels_reviews_anno <- rbindlist(brussels_reviews_anno) str(brussels_reviews_anno) ## End(Not run)
## brussels_reviews_anno data(brussels_reviews_anno) head(brussels_reviews_anno) sort(table(brussels_reviews_anno$xpos)) ## Not run: ## ## If you want to construct a similar dataset as the ## brussels_reviews_anno dataset based on the udpipe library, do as follows ## library(udpipe) library(data.table) data(brussels_reviews) ## The brussels_reviews contains comments on Airbnb sites in 3 languages: es, fr and nl table(brussels_reviews$language) bxl_anno <- split(brussels_reviews, brussels_reviews$language) ## Annotate the Spanish comments m <- udpipe_download_model(language = "spanish-ancora") m <- udpipe_load_model(file = m$file_model) bxl_anno$es <- udpipe_annotate(object = m, x = bxl_anno$es$feedback, doc_id = bxl_anno$es$id) ## Annotate the French comments m <- udpipe_download_model(language = "french-partut") m <- udpipe_load_model(file = m$file_model) bxl_anno$fr <- udpipe_annotate(object = m, x = bxl_anno$fr$feedback, doc_id = bxl_anno$fr$id) ## Annotate the Dutch comments m <- udpipe_download_model(language = "dutch-lassysmall") m <- udpipe_load_model(file = m$file_model) bxl_anno$nl <- udpipe_annotate(object = m, x = bxl_anno$nl$feedback, doc_id = bxl_anno$nl$id) brussels_reviews_anno <- lapply(bxl_anno, as.data.frame) brussels_reviews_anno <- rbindlist(brussels_reviews_anno) str(brussels_reviews_anno) ## End(Not run)
An simple 10-dimensional example matrix of word embeddings trained on the Dutch lemma's
of the dataset brussels_reviews_anno
data(brussels_reviews_w2v_embeddings_lemma_nl) head(brussels_reviews_w2v_embeddings_lemma_nl)
data(brussels_reviews_w2v_embeddings_lemma_nl) head(brussels_reviews_w2v_embeddings_lemma_nl)
Annotated results of udpipe_annotate
contain dependency parsing results which indicate
how each word is linked to another word and the relation between these 2 words.
This information is available in the fields token_id, head_token_id and dep_rel which indicates how each token
is linked to the parent. The type of relation (dep_rel) is defined at
https://universaldependencies.org/u/dep/index.html.
For example in the text 'The economy is weak but the outlook is bright', the term economy is linked to weak
as the term economy is the nominal subject of weak.
This function adds the parent or child information to the annotated data.frame.
cbind_dependencies( x, type = c("parent", "child", "parent_rowid", "child_rowid"), recursive = FALSE )
cbind_dependencies( x, type = c("parent", "child", "parent_rowid", "child_rowid"), recursive = FALSE )
x |
a data.frame or data.table as returned by |
type |
either one of 'parent', 'child', 'parent_rowid', 'child_rowid'. Look to the return value section for more information on the difference in logic. Defaults to 'parent', indicating to add the information of the head_token_id to the dataset |
recursive |
in case when |
Mark that the output which this function provides might possibly change in subsequent releases and is experimental.
a data.frame/data.table in the same order of x
where extra information is added on top namely:
In case type
is set to 'parent'
: the token/lemma/upos/xpos/feats information of the parent (head dependency) is added to the data.frame. See the examples.
In case type
is set to 'child'
: the token/lemma/upos/xpos/feats/dep_rel information of all the children is put into a column called 'children' which is added to the data.frame. This is a list column where each list element is a data.table with these
columns: token/lemma/upos/xpos/dep_rel. See the examples.
In case type
is set to 'parent_rowid'
: a new list column is added to x
containing the row numbers within each combination of doc_id, paragraph_id, sentence_id
which are parents of the token.
In case recursive is set to TRUE
the new column which is added to the data.frame is called parent_rowids
, otherwise it is called parent_rowid
. See the examples.
In case type
is set to 'child_rowid'
: a new list column is added to x
containing the row numbers within each combination of doc_id, paragraph_id, sentence_id
which are children of the token.
In case recursive is set to TRUE
the new column which is added to the data.frame is called child_rowids
, otherwise it is called child_rowid
. See the examples.
## Not run: udmodel <- udpipe_download_model(language = "english-ewt") udmodel <- udpipe_load_model(file = udmodel$file_model) x <- udpipe_annotate(udmodel, x = "The economy is weak but the outlook is bright") x <- as.data.frame(x) x[, c("token_id", "token", "head_token_id", "dep_rel")] x <- cbind_dependencies(x, type = "parent") nominalsubject <- subset(x, dep_rel %in% c("nsubj")) nominalsubject <- nominalsubject[, c("dep_rel", "token", "token_parent")] nominalsubject x <- cbind_dependencies(x, type = "child") x <- cbind_dependencies(x, type = "parent_rowid") x <- cbind_dependencies(x, type = "parent_rowid", recursive = TRUE) x <- cbind_dependencies(x, type = "child_rowid") x <- cbind_dependencies(x, type = "child_rowid", recursive = TRUE) x lapply(x$child_rowid, FUN=function(i) x[sort(i), ]) ## End(Not run)
## Not run: udmodel <- udpipe_download_model(language = "english-ewt") udmodel <- udpipe_load_model(file = udmodel$file_model) x <- udpipe_annotate(udmodel, x = "The economy is weak but the outlook is bright") x <- as.data.frame(x) x[, c("token_id", "token", "head_token_id", "dep_rel")] x <- cbind_dependencies(x, type = "parent") nominalsubject <- subset(x, dep_rel %in% c("nsubj")) nominalsubject <- nominalsubject[, c("dep_rel", "token", "token_parent")] nominalsubject x <- cbind_dependencies(x, type = "child") x <- cbind_dependencies(x, type = "parent_rowid") x <- cbind_dependencies(x, type = "parent_rowid", recursive = TRUE) x <- cbind_dependencies(x, type = "child_rowid") x <- cbind_dependencies(x, type = "child_rowid", recursive = TRUE) x lapply(x$child_rowid, FUN=function(i) x[sort(i), ]) ## End(Not run)
The result of udpipe_annotate
which is put into a data.frame
returns a field called feats
containing morphological features as defined at
https://universaldependencies.org/u/feat/index.html. If there are several of these features,
these are concatenated with the |
symbol. This function extracts each of these morphological
features separately and adds these as extra columns to the data.frame
cbind_morphological(x, term = "feats", which)
cbind_morphological(x, term = "feats", which)
x |
a data.frame or data.table as returned by |
term |
the name of the field in |
which |
a character vector with names of morphological features to uniquely parse out. These features are one of the 24 lexical and grammatical properties of words defined at https://universaldependencies.org/u/feat/index.html. Possible values are:
See the examples. |
x
in the same order with extra columns added (at least the column has_morph is added indicating
if any morphological features are present and as well extra columns for each possible morphological feature in the data)
## Not run: udmodel <- udpipe_download_model(language = "english-ewt") udmodel <- udpipe_load_model(file = udmodel$file_model) x <- udpipe_annotate(udmodel, x = "The economy is weak but the outlook is bright") x <- as.data.frame(x) x <- cbind_morphological(x, term = "feats") ## End(Not run) f <- system.file(package = "udpipe", "dummydata", "traindata.conllu") x <- udpipe_read_conllu(f) x <- cbind_morphological(x, term = "feats") f <- system.file(package = "udpipe", "dummydata", "traindata.conllu") x <- udpipe_read_conllu(f) x <- cbind_morphological(x, term = "feats", which = c("Mood", "Gender", "VerbForm", "Polarity", "Polite")) # extract all features from the feats column even if not present in the data f <- system.file(package = "udpipe", "dummydata", "traindata.conllu") x <- udpipe_read_conllu(f) x <- cbind_morphological(x, term = "feats", which = c("lexical", "inflectional_noun", "inflectional_verb"))
## Not run: udmodel <- udpipe_download_model(language = "english-ewt") udmodel <- udpipe_load_model(file = udmodel$file_model) x <- udpipe_annotate(udmodel, x = "The economy is weak but the outlook is bright") x <- as.data.frame(x) x <- cbind_morphological(x, term = "feats") ## End(Not run) f <- system.file(package = "udpipe", "dummydata", "traindata.conllu") x <- udpipe_read_conllu(f) x <- cbind_morphological(x, term = "feats") f <- system.file(package = "udpipe", "dummydata", "traindata.conllu") x <- udpipe_read_conllu(f) x <- cbind_morphological(x, term = "feats", which = c("Mood", "Gender", "VerbForm", "Polarity", "Polite")) # extract all features from the feats column even if not present in the data f <- system.file(package = "udpipe", "dummydata", "traindata.conllu") x <- udpipe_read_conllu(f) x <- cbind_morphological(x, term = "feats", which = c("lexical", "inflectional_noun", "inflectional_verb"))
A cooccurence data.frame indicates how many times each term co-occurs with another term.
There are 3 types of cooccurrences:
Looking at which words are located in the same document/sentence/paragraph.
Looking at which words are followed by another word
Looking at which words are in the neighbourhood of the word as in follows the word within skipgram
number of words
The output of the function gives a cooccurrence data.frame which contains the fields term1, term2 and cooc where cooc indicates how many times term1 and term2 co-occurred. This dataset can be constructed
based upon a data frame where you look within a group (column of the data.frame) if 2 terms occurred in that group.
based upon a vector of words in which case we look how many times each word is followed by another word.
based upon a vector of words in which case we look how many times each word is followed by another word or is followed by another word if we skip a number of words in between.
Note that
For cooccurrence.data.frame no ordering is assumed which implies that the function does not return self-occurrences if a word occurs several times in the same group of text and term1 is always smaller than term2 in the output
For cooccurrence.character we assume text is ordered from left to right, the function as well returns self-occurrences
You can also aggregate cooccurrences if you decide to do any of these 3 by a certain group and next want to obtain an overall aggregate.
cooccurrence(x, order = TRUE, ...) ## S3 method for class 'character' cooccurrence( x, order = TRUE, ..., relevant = rep(TRUE, length(x)), skipgram = 0 ) ## S3 method for class 'cooccurrence' cooccurrence(x, order = TRUE, ...) ## S3 method for class 'data.frame' cooccurrence(x, order = TRUE, ..., group, term)
cooccurrence(x, order = TRUE, ...) ## S3 method for class 'character' cooccurrence( x, order = TRUE, ..., relevant = rep(TRUE, length(x)), skipgram = 0 ) ## S3 method for class 'cooccurrence' cooccurrence(x, order = TRUE, ...) ## S3 method for class 'data.frame' cooccurrence(x, order = TRUE, ..., group, term)
x |
either
|
order |
logical indicating if we need to sort the output from high cooccurrences to low coccurrences. Defaults to TRUE. |
... |
other arguments passed on to the methods |
relevant |
a logical vector of the same length as |
skipgram |
integer of length 1, indicating how far in the neighbourhood to look for words. |
group |
character vector of columns in the data frame |
term |
character string of a column in the data frame |
a data.frame with columns term1, term2 and cooc indicating for the combination of term1 and term2 how many times this combination occurred
character
: Create a cooccurence data.frame based on a vector of terms
cooccurrence
: Aggregate co-occurrence statistics by summing the cooc by term/term2
data.frame
: Create a cooccurence data.frame based on a data.frame where you look within a document / sentence / paragraph / group
if terms co-occur
data(brussels_reviews_anno) ## By document, which lemma's co-occur x <- subset(brussels_reviews_anno, xpos %in% c("NN", "JJ") & language %in% "fr") x <- cooccurrence(x, group = "doc_id", term = "lemma") head(x) ## Which words follow each other x <- c("A", "B", "A", "A", "B", "c") cooccurrence(x) data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, language == "es") x <- cooccurrence(x$lemma) head(x) x <- subset(brussels_reviews_anno, language == "es") x <- cooccurrence(x$lemma, relevant = x$xpos %in% c("NN", "JJ"), skipgram = 4) head(x) ## Which nouns follow each other in the same document library(data.table) x <- as.data.table(brussels_reviews_anno) x <- subset(x, language == "nl" & xpos %in% c("NN")) x <- x[, cooccurrence(lemma, order = FALSE), by = list(doc_id)] head(x) x_nodoc <- cooccurrence(x) x_nodoc <- subset(x_nodoc, term1 != "appartement" & term2 != "appartement") head(x_nodoc)
data(brussels_reviews_anno) ## By document, which lemma's co-occur x <- subset(brussels_reviews_anno, xpos %in% c("NN", "JJ") & language %in% "fr") x <- cooccurrence(x, group = "doc_id", term = "lemma") head(x) ## Which words follow each other x <- c("A", "B", "A", "A", "B", "c") cooccurrence(x) data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, language == "es") x <- cooccurrence(x$lemma) head(x) x <- subset(brussels_reviews_anno, language == "es") x <- cooccurrence(x$lemma, relevant = x$xpos %in% c("NN", "JJ"), skipgram = 4) head(x) ## Which nouns follow each other in the same document library(data.table) x <- as.data.table(brussels_reviews_anno) x <- subset(x, language == "nl" & xpos %in% c("NN")) x <- x[, cooccurrence(lemma, order = FALSE), by = list(doc_id)] head(x) x_nodoc <- cooccurrence(x) x_nodoc <- subset(x_nodoc, term1 != "appartement" & term2 != "appartement") head(x_nodoc)
Aggregate a data.frame to the document/term level by calculating how many times a term occurs per document
document_term_frequencies(x, document, ...) ## S3 method for class 'data.frame' document_term_frequencies( x, document = colnames(x)[1], term = colnames(x)[2], ... ) ## S3 method for class 'character' document_term_frequencies( x, document = paste("doc", seq_along(x), sep = ""), split = "[[:space:][:punct:][:digit:]]+", ... )
document_term_frequencies(x, document, ...) ## S3 method for class 'data.frame' document_term_frequencies( x, document = colnames(x)[1], term = colnames(x)[2], ... ) ## S3 method for class 'character' document_term_frequencies( x, document = paste("doc", seq_along(x), sep = ""), split = "[[:space:][:punct:][:digit:]]+", ... )
x |
a data.frame or data.table containing a field which can be considered
as a document (defaults to the first column in |
document |
If |
... |
further arguments passed on to the methods |
term |
If |
split |
The regular expression to be used if |
a data.table with columns doc_id, term, freq indicating how many times a term occurred in each document.
If freq occurred in the input dataset the resulting data will have summed the freq. If freq is not in the dataset,
will assume that freq is 1 for each row in the input dataset x
.
data.frame
: Create a data.frame with one row per document/term combination indicating the frequency of the term in the document
character
: Create a data.frame with one row per document/term combination indicating the frequency of the term in the document
## ## Calculate document_term_frequencies on a data.frame ## data(brussels_reviews_anno) x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "token")]) x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "lemma")]) str(x) brussels_reviews_anno$my_doc_id <- paste(brussels_reviews_anno$doc_id, brussels_reviews_anno$sentence_id) x <- document_term_frequencies(brussels_reviews_anno[, c("my_doc_id", "lemma")]) ## ## Calculate document_term_frequencies on a character vector ## data(brussels_reviews) x <- document_term_frequencies(x = brussels_reviews$feedback, document = brussels_reviews$id, split = " ") x <- document_term_frequencies(x = brussels_reviews$feedback, document = brussels_reviews$id, split = "[[:space:][:punct:][:digit:]]+") ## ## document-term-frequencies on several fields to easily include bigram and trigrams ## library(data.table) x <- as.data.table(brussels_reviews_anno) x <- x[, token_bigram := txt_nextgram(token, n = 2), by = list(doc_id, sentence_id)] x <- x[, token_trigram := txt_nextgram(token, n = 3), by = list(doc_id, sentence_id)] x <- document_term_frequencies(x = x, document = "doc_id", term = c("token", "token_bigram", "token_trigram")) head(x)
## ## Calculate document_term_frequencies on a data.frame ## data(brussels_reviews_anno) x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "token")]) x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "lemma")]) str(x) brussels_reviews_anno$my_doc_id <- paste(brussels_reviews_anno$doc_id, brussels_reviews_anno$sentence_id) x <- document_term_frequencies(brussels_reviews_anno[, c("my_doc_id", "lemma")]) ## ## Calculate document_term_frequencies on a character vector ## data(brussels_reviews) x <- document_term_frequencies(x = brussels_reviews$feedback, document = brussels_reviews$id, split = " ") x <- document_term_frequencies(x = brussels_reviews$feedback, document = brussels_reviews$id, split = "[[:space:][:punct:][:digit:]]+") ## ## document-term-frequencies on several fields to easily include bigram and trigrams ## library(data.table) x <- as.data.table(brussels_reviews_anno) x <- x[, token_bigram := txt_nextgram(token, n = 2), by = list(doc_id, sentence_id)] x <- x[, token_trigram := txt_nextgram(token, n = 3), by = list(doc_id, sentence_id)] x <- document_term_frequencies(x = x, document = "doc_id", term = c("token", "token_bigram", "token_trigram")) head(x)
Term frequency Inverse Document Frequency (tfidf) is calculated as the multiplication of
Term Frequency (tf): how many times the word occurs in the document / how many words are in the document
Inverse Document Frequency (idf): log(number of documents / number of documents where the term appears)
The Okapi BM25 statistic is calculated as the multiplication of the inverse document frequency and the weighted term frequency as defined at https://en.wikipedia.org/wiki/Okapi_BM25.
document_term_frequencies_statistics(x, k = 1.2, b = 0.75)
document_term_frequencies_statistics(x, k = 1.2, b = 0.75)
x |
a data.table as returned by |
k |
parameter k1 of the Okapi BM25 ranking function as defined at https://en.wikipedia.org/wiki/Okapi_BM25. Defaults to 1.2. |
b |
parameter b of the Okapi BM25 ranking function as defined at https://en.wikipedia.org/wiki/Okapi_BM25. Defaults to 0.5. |
a data.table with columns doc_id, term, freq and added to that the computed statistics tf, idf, tfidf, tf_bm25 and bm25.
data(brussels_reviews_anno) x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "token")]) x <- document_term_frequencies_statistics(x) head(x)
data(brussels_reviews_anno) x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "token")]) x <- document_term_frequencies_statistics(x) head(x)
Create a document/term matrix from either
a data.frame with 1 row per document/term as returned by document_term_frequencies
a list of tokens from e.g. from package sentencepiece, tokenizers.bpe or just by using strsplit
an object of class DocumentTermMatrix or TermDocumentMatrix from the tm package
an object of class simple_triplet_matrix from the slam package
a regular dense matrix
document_term_matrix(x, vocabulary, weight = "freq", ...) ## S3 method for class 'data.frame' document_term_matrix(x, vocabulary, weight = "freq", ...) ## S3 method for class 'matrix' document_term_matrix(x, ...) ## S3 method for class 'integer' document_term_matrix(x, ...) ## S3 method for class 'numeric' document_term_matrix(x, ...) ## Default S3 method: document_term_matrix(x, vocabulary, ...) ## S3 method for class 'DocumentTermMatrix' document_term_matrix(x, ...) ## S3 method for class 'TermDocumentMatrix' document_term_matrix(x, ...) ## S3 method for class 'simple_triplet_matrix' document_term_matrix(x, ...)
document_term_matrix(x, vocabulary, weight = "freq", ...) ## S3 method for class 'data.frame' document_term_matrix(x, vocabulary, weight = "freq", ...) ## S3 method for class 'matrix' document_term_matrix(x, ...) ## S3 method for class 'integer' document_term_matrix(x, ...) ## S3 method for class 'numeric' document_term_matrix(x, ...) ## Default S3 method: document_term_matrix(x, vocabulary, ...) ## S3 method for class 'DocumentTermMatrix' document_term_matrix(x, ...) ## S3 method for class 'TermDocumentMatrix' document_term_matrix(x, ...) ## S3 method for class 'simple_triplet_matrix' document_term_matrix(x, ...)
x |
a data.frame with columns doc_id, term and freq indicating how many times a term occurred in that specific document. This is what |
vocabulary |
a character vector of terms which should be present in the document term matrix even if they did not occur in |
weight |
a column of |
... |
further arguments currently not used |
an sparse object of class dgCMatrix with in the rows the documents and in the columns the terms containing the frequencies
provided in x
extended with terms which were not in x
but were provided in vocabulary
.
The rownames of this resulting object contain the doc_id from x
data.frame
: Construct a document term matrix from a data.frame with columns doc_id, term, freq
matrix
: Construct a sparse document term matrix from a matrix
integer
: Construct a sparse document term matrix from an named integer vector
numeric
: Construct a sparse document term matrix from a named numeric vector
default
: Construct a document term matrix from a list of tokens
DocumentTermMatrix
: Convert an object of class DocumentTermMatrix
from the tm package to a sparseMatrix
TermDocumentMatrix
: Convert an object of class TermDocumentMatrix
from the tm package to a sparseMatrix with
the documents in the rows and the terms in the columns
simple_triplet_matrix
: Convert an object of class simple_triplet_matrix
from the slam package to a sparseMatrix
sparseMatrix
, document_term_frequencies
x <- data.frame(doc_id = c(1, 1, 2, 3, 4), term = c("A", "C", "Z", "X", "G"), freq = c(1, 5, 7, 10, 0)) document_term_matrix(x) document_term_matrix(x, vocabulary = LETTERS) ## Example on larger dataset data(brussels_reviews_anno) x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "lemma")]) dtm <- document_term_matrix(x) dim(dtm) x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "lemma")]) x <- document_term_frequencies_statistics(x) dtm <- document_term_matrix(x) dtm <- document_term_matrix(x, weight = "freq") dtm <- document_term_matrix(x, weight = "tf_idf") dtm <- document_term_matrix(x, weight = "bm25") x <- split(brussels_reviews_anno$lemma, brussels_reviews_anno$doc_id) dtm <- document_term_matrix(x) ## example showing the vocubulary argument ## allowing you to making sure terms which are not in the data are provided in the resulting dtm allterms <- unique(x$term) dtm <- document_term_matrix(head(x, 1000), vocabulary = allterms) ## example for a list of tokens x <- list(doc1 = c("aa", "bb", "cc", "aa", "b"), doc2 = c("bb", "bb", "dd", ""), doc3 = character(), doc4 = c("cc", NA), doc5 = character()) document_term_matrix(x) dtm <- document_term_matrix(x, vocabulary = c("a", "bb", "cc")) dtm <- dtm_conform(dtm, rows = c("doc1", "doc2", "doc7"), columns = c("a", "bb", "cc")) data(brussels_reviews) x <- strsplit(setNames(brussels_reviews$feedback, brussels_reviews$id), split = " +") x <- document_term_matrix(x) ## ## Example adding bigrams/trigrams to the document term matrix ## Mark that this can also be done using ?dtm_cbind ## library(data.table) x <- as.data.table(brussels_reviews_anno) x <- x[, token_bigram := txt_nextgram(token, n = 2), by = list(doc_id, sentence_id)] x <- x[, token_trigram := txt_nextgram(token, n = 3), by = list(doc_id, sentence_id)] x <- document_term_frequencies(x = x, document = "doc_id", term = c("token", "token_bigram", "token_trigram")) dtm <- document_term_matrix(x) ## ## Convert dense matrix to sparse matrix ## x <- matrix(c(0, 0, 0, 1, NA, 3, 4, 5, 6, 7), nrow = 2) x dtm <- document_term_matrix(x) dtm x <- matrix(c(0, 0, 0, 0.1, NA, 0.3, 0.4, 0.5, 0.6, 0.7), nrow = 2) x dtm <- document_term_matrix(x) dtm x <- setNames(c(TRUE, NA, FALSE, FALSE), c("a", "b", "c", "d")) x <- as.matrix(x) dtm <- document_term_matrix(x) dtm ## ## Convert vectors to sparse matrices ## x <- setNames(-3:3, c("a", "b", "c", "d", "e", "f")) dtm <- document_term_matrix(x) dtm x <- setNames(runif(6), c("a", "b", "c", "d", "e", "f")) dtm <- document_term_matrix(x) dtm ## ## Convert lists to sparse matrices ## x <- list(a = c("some", "set", "of", "words"), b1 = NA, b2 = NA, c1 = character(), c2 = 0, d = c("words", "words", "words")) dtm <- document_term_matrix(x) dtm
x <- data.frame(doc_id = c(1, 1, 2, 3, 4), term = c("A", "C", "Z", "X", "G"), freq = c(1, 5, 7, 10, 0)) document_term_matrix(x) document_term_matrix(x, vocabulary = LETTERS) ## Example on larger dataset data(brussels_reviews_anno) x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "lemma")]) dtm <- document_term_matrix(x) dim(dtm) x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "lemma")]) x <- document_term_frequencies_statistics(x) dtm <- document_term_matrix(x) dtm <- document_term_matrix(x, weight = "freq") dtm <- document_term_matrix(x, weight = "tf_idf") dtm <- document_term_matrix(x, weight = "bm25") x <- split(brussels_reviews_anno$lemma, brussels_reviews_anno$doc_id) dtm <- document_term_matrix(x) ## example showing the vocubulary argument ## allowing you to making sure terms which are not in the data are provided in the resulting dtm allterms <- unique(x$term) dtm <- document_term_matrix(head(x, 1000), vocabulary = allterms) ## example for a list of tokens x <- list(doc1 = c("aa", "bb", "cc", "aa", "b"), doc2 = c("bb", "bb", "dd", ""), doc3 = character(), doc4 = c("cc", NA), doc5 = character()) document_term_matrix(x) dtm <- document_term_matrix(x, vocabulary = c("a", "bb", "cc")) dtm <- dtm_conform(dtm, rows = c("doc1", "doc2", "doc7"), columns = c("a", "bb", "cc")) data(brussels_reviews) x <- strsplit(setNames(brussels_reviews$feedback, brussels_reviews$id), split = " +") x <- document_term_matrix(x) ## ## Example adding bigrams/trigrams to the document term matrix ## Mark that this can also be done using ?dtm_cbind ## library(data.table) x <- as.data.table(brussels_reviews_anno) x <- x[, token_bigram := txt_nextgram(token, n = 2), by = list(doc_id, sentence_id)] x <- x[, token_trigram := txt_nextgram(token, n = 3), by = list(doc_id, sentence_id)] x <- document_term_frequencies(x = x, document = "doc_id", term = c("token", "token_bigram", "token_trigram")) dtm <- document_term_matrix(x) ## ## Convert dense matrix to sparse matrix ## x <- matrix(c(0, 0, 0, 1, NA, 3, 4, 5, 6, 7), nrow = 2) x dtm <- document_term_matrix(x) dtm x <- matrix(c(0, 0, 0, 0.1, NA, 0.3, 0.4, 0.5, 0.6, 0.7), nrow = 2) x dtm <- document_term_matrix(x) dtm x <- setNames(c(TRUE, NA, FALSE, FALSE), c("a", "b", "c", "d")) x <- as.matrix(x) dtm <- document_term_matrix(x) dtm ## ## Convert vectors to sparse matrices ## x <- setNames(-3:3, c("a", "b", "c", "d", "e", "f")) dtm <- document_term_matrix(x) dtm x <- setNames(runif(6), c("a", "b", "c", "d", "e", "f")) dtm <- document_term_matrix(x) dtm ## ## Convert lists to sparse matrices ## x <- list(a = c("some", "set", "of", "words"), b1 = NA, b2 = NA, c1 = character(), c2 = 0, d = c("words", "words", "words")) dtm <- document_term_matrix(x) dtm
This utility function is useful to align a Document-Term-Matrix with
information in a data.frame or a vector to predict, such that both the predictive information as well as the target
is available in the same order.
Matching is done based on the identifiers in the rownames of x
and either the names of the y
vector
or the first column of y
in case it is a data.frame.
dtm_align(x, y, FUN, ...)
dtm_align(x, y, FUN, ...)
x |
a Document-Term-Matrix of class dgCMatrix (which can be an object returned by |
y |
either a vector or data.frame containing something to align with
|
FUN |
a function to be applied on |
... |
further arguments passed on to FUN |
a list with elements x
and y
containing the document term matrix x
in the same order as y
.
If in y
a vector was passed, the returned y
element will be a vector
If in y
a data.frame was passed with more than 2 columns, the returned y
element will be a data.frame
If in y
a data.frame was passed with exactly 2 columns, the returned y
element will be a vector
Only returns data of x
with overlapping identifiers in y
.
x <- matrix(1:9, nrow = 3, dimnames = list(c("a", "b", "c"))) x dtm_align(x = x, y = c(b = 1, a = 2, c = 6, d = 6)) dtm_align(x = x, y = c(b = 1, a = 2, c = 6, d = 6, d = 7, a = -1)) data(brussels_reviews) data(brussels_listings) x <- brussels_reviews x <- strsplit.data.frame(x, term = "feedback", group = "listing_id") x <- document_term_frequencies(x) x <- document_term_matrix(x) y <- brussels_listings$price names(y) <- brussels_listings$listing_id ## align a matrix of predictors with a vector to predict trainset <- dtm_align(x = x, y = y) trainset <- dtm_align(x = x, y = y, FUN = function(dtm){ dtm <- dtm_remove_lowfreq(dtm, minfreq = 5) dtm <- dtm_sample(dtm) dtm }) head(names(y)) head(rownames(x)) head(names(trainset$y)) head(rownames(trainset$x)) ## align a matrix of predictors with a data.frame trainset <- dtm_align(x = x, y = brussels_listings[, c("listing_id", "price")]) trainset <- dtm_align(x = x, y = brussels_listings[, c("listing_id", "price", "room_type")]) head(trainset$y$listing_id) head(rownames(trainset$x)) ## example with duplicate data in case of data balancing dtm_align(x = matrix(1:30, nrow = 3, dimnames = list(c("a", "b", "c"))), y = c(a = 1, a = 2, b = 3, d = 6, b = 6)) target <- subset(brussels_listings, listing_id %in% brussels_reviews$listing_id) target <- rbind(target[1:3, ], target[c(2, 3), ], target[c(1, 4), ]) trainset <- dtm_align(x = x, y = target[, c("listing_id", "price")]) trainset <- dtm_align(x = x, y = setNames(target$price, target$listing_id)) names(trainset$y) rownames(trainset$x)
x <- matrix(1:9, nrow = 3, dimnames = list(c("a", "b", "c"))) x dtm_align(x = x, y = c(b = 1, a = 2, c = 6, d = 6)) dtm_align(x = x, y = c(b = 1, a = 2, c = 6, d = 6, d = 7, a = -1)) data(brussels_reviews) data(brussels_listings) x <- brussels_reviews x <- strsplit.data.frame(x, term = "feedback", group = "listing_id") x <- document_term_frequencies(x) x <- document_term_matrix(x) y <- brussels_listings$price names(y) <- brussels_listings$listing_id ## align a matrix of predictors with a vector to predict trainset <- dtm_align(x = x, y = y) trainset <- dtm_align(x = x, y = y, FUN = function(dtm){ dtm <- dtm_remove_lowfreq(dtm, minfreq = 5) dtm <- dtm_sample(dtm) dtm }) head(names(y)) head(rownames(x)) head(names(trainset$y)) head(rownames(trainset$x)) ## align a matrix of predictors with a data.frame trainset <- dtm_align(x = x, y = brussels_listings[, c("listing_id", "price")]) trainset <- dtm_align(x = x, y = brussels_listings[, c("listing_id", "price", "room_type")]) head(trainset$y$listing_id) head(rownames(trainset$x)) ## example with duplicate data in case of data balancing dtm_align(x = matrix(1:30, nrow = 3, dimnames = list(c("a", "b", "c"))), y = c(a = 1, a = 2, b = 3, d = 6, b = 6)) target <- subset(brussels_listings, listing_id %in% brussels_reviews$listing_id) target <- rbind(target[1:3, ], target[c(2, 3), ], target[c(1, 4), ]) trainset <- dtm_align(x = x, y = target[, c("listing_id", "price")]) trainset <- dtm_align(x = x, y = setNames(target$price, target$listing_id)) names(trainset$y) rownames(trainset$x)
These 2 methods provide cbind
and rbind
functionality
for sparse matrix objects which are returned by document_term_matrix
.
In case of dtm_cbind
, if the rows are not ordered in the same way in x and y, it will order them based on the rownames.
If there are missing rows these will be filled with NA values.
In case of dtm_rbind
, if the columns are not ordered in the same way in x and y, it will order them based on the colnames.
If there are missing columns these will be filled with NA values.
dtm_cbind(x, y, ...) dtm_rbind(x, y, ...)
dtm_cbind(x, y, ...) dtm_rbind(x, y, ...)
x |
a sparse matrix such as a "dgCMatrix" object which is returned by |
y |
a sparse matrix such as a "dgCMatrix" object which is returned by |
... |
more sparse matrices |
a sparse matrix where either rows are put below each other in case of dtm_rbind
or columns are put next to each other in case of dtm_cbind
data(brussels_reviews_anno) x <- brussels_reviews_anno ## rbind dtm1 <- document_term_frequencies(x = subset(x, doc_id %in% c("10049756", "10284782")), document = "doc_id", term = "token") dtm1 <- document_term_matrix(dtm1) dtm2 <- document_term_frequencies(x = subset(x, doc_id %in% c("10789408", "12285061", "35509091")), document = "doc_id", term = "token") dtm2 <- document_term_matrix(dtm2) dtm3 <- document_term_frequencies(x = subset(x, doc_id %in% c("31133394", "36224131")), document = "doc_id", term = "token") dtm3 <- document_term_matrix(dtm3) m <- dtm_rbind(dtm1, dtm2) dim(m) m <- dtm_rbind(dtm1, dtm2, dtm3) dim(m) ## cbind library(data.table) x <- subset(brussels_reviews_anno, language %in% c("nl", "fr")) x <- as.data.table(x) x <- x[, token_bigram := txt_nextgram(token, n = 2), by = list(doc_id, sentence_id)] x <- x[, lemma_upos := sprintf("%s//%s", lemma, upos)] dtm1 <- document_term_frequencies(x = x, document = "doc_id", term = c("token")) dtm1 <- document_term_matrix(dtm1) dtm2 <- document_term_frequencies(x = x, document = "doc_id", term = c("token_bigram")) dtm2 <- document_term_matrix(dtm2) dtm3 <- document_term_frequencies(x = x, document = "doc_id", term = c("upos")) dtm3 <- document_term_matrix(dtm3) dtm4 <- document_term_frequencies(x = x, document = "doc_id", term = c("lemma_upos")) dtm4 <- document_term_matrix(dtm4) m <- dtm_cbind(dtm1, dtm2) dim(m) m <- dtm_cbind(dtm1, dtm2, dtm3, dtm4) dim(m) m <- dtm_cbind(dtm1[-c(100, 999), ], dtm2[-1000,]) dim(m)
data(brussels_reviews_anno) x <- brussels_reviews_anno ## rbind dtm1 <- document_term_frequencies(x = subset(x, doc_id %in% c("10049756", "10284782")), document = "doc_id", term = "token") dtm1 <- document_term_matrix(dtm1) dtm2 <- document_term_frequencies(x = subset(x, doc_id %in% c("10789408", "12285061", "35509091")), document = "doc_id", term = "token") dtm2 <- document_term_matrix(dtm2) dtm3 <- document_term_frequencies(x = subset(x, doc_id %in% c("31133394", "36224131")), document = "doc_id", term = "token") dtm3 <- document_term_matrix(dtm3) m <- dtm_rbind(dtm1, dtm2) dim(m) m <- dtm_rbind(dtm1, dtm2, dtm3) dim(m) ## cbind library(data.table) x <- subset(brussels_reviews_anno, language %in% c("nl", "fr")) x <- as.data.table(x) x <- x[, token_bigram := txt_nextgram(token, n = 2), by = list(doc_id, sentence_id)] x <- x[, lemma_upos := sprintf("%s//%s", lemma, upos)] dtm1 <- document_term_frequencies(x = x, document = "doc_id", term = c("token")) dtm1 <- document_term_matrix(dtm1) dtm2 <- document_term_frequencies(x = x, document = "doc_id", term = c("token_bigram")) dtm2 <- document_term_matrix(dtm2) dtm3 <- document_term_frequencies(x = x, document = "doc_id", term = c("upos")) dtm3 <- document_term_matrix(dtm3) dtm4 <- document_term_frequencies(x = x, document = "doc_id", term = c("lemma_upos")) dtm4 <- document_term_matrix(dtm4) m <- dtm_cbind(dtm1, dtm2) dim(m) m <- dtm_cbind(dtm1, dtm2, dtm3, dtm4) dim(m) m <- dtm_cbind(dtm1[-c(100, 999), ], dtm2[-1000,]) dim(m)
Perform a chisq.test
to compare if groups of documents have more prevalence of specific terms.
The function looks to each term in the document term matrix and applies a chisq.test
comparing the frequency
of occurrence of each term compared to the other terms in the document group.
dtm_chisq(dtm, groups, correct = TRUE, ...)
dtm_chisq(dtm, groups, correct = TRUE, ...)
dtm |
a document term matrix: an object returned by |
groups |
a logical vector with 2 groups (TRUE / FALSE) where the size of the |
correct |
passed on to |
... |
further arguments passed on to |
a data.frame with columns term, chisq, p.value, freq, freq_true, freq_false indicating for each term in the dtm
,
how frequently it occurs in each group, the Chi-Square value and it's corresponding p-value.
data(brussels_reviews_anno) ## ## Which nouns occur in text containing the term 'centre' ## x <- subset(brussels_reviews_anno, xpos == "NN" & language == "fr") x <- x[, c("doc_id", "lemma")] x <- document_term_frequencies(x) dtm <- document_term_matrix(x) relevant <- dtm_chisq(dtm, groups = dtm[, "centre"] > 0) head(relevant, 10) ## ## Which adjectives occur in text containing the term 'hote' ## x <- subset(brussels_reviews_anno, xpos == "JJ" & language == "fr") x <- x[, c("doc_id", "lemma")] x <- document_term_frequencies(x) dtm <- document_term_matrix(x) group <- subset(brussels_reviews_anno, lemma %in% "hote") group <- rownames(dtm) %in% group$doc_id relevant <- dtm_chisq(dtm, groups = group) head(relevant, 10) ## Not run: # do not show scientific notation of the p-values options(scipen = 100) head(relevant, 10) ## End(Not run)
data(brussels_reviews_anno) ## ## Which nouns occur in text containing the term 'centre' ## x <- subset(brussels_reviews_anno, xpos == "NN" & language == "fr") x <- x[, c("doc_id", "lemma")] x <- document_term_frequencies(x) dtm <- document_term_matrix(x) relevant <- dtm_chisq(dtm, groups = dtm[, "centre"] > 0) head(relevant, 10) ## ## Which adjectives occur in text containing the term 'hote' ## x <- subset(brussels_reviews_anno, xpos == "JJ" & language == "fr") x <- x[, c("doc_id", "lemma")] x <- document_term_frequencies(x) dtm <- document_term_matrix(x) group <- subset(brussels_reviews_anno, lemma %in% "hote") group <- rownames(dtm) %in% group$doc_id relevant <- dtm_chisq(dtm, groups = group) head(relevant, 10) ## Not run: # do not show scientific notation of the p-values options(scipen = 100) head(relevant, 10) ## End(Not run)
Column sums and Row sums for document term matrices
dtm_colsums(dtm, groups) dtm_rowsums(dtm, groups)
dtm_colsums(dtm, groups) dtm_rowsums(dtm, groups)
dtm |
an object returned by |
groups |
optionally, a list with column/row names or column/row indexes of the |
Returns either a vector in case argument groups
is not provided or a sparse matrix of class dgCMatrix
in case argument groups
is provided
in case groups
is not provided: a vector of row/column sums with corresponding names
in case groups
is provided: a sparse matrix containing summed information over the groups of rows/columns
x <- data.frame( doc_id = c(1, 1, 2, 3, 4), term = c("A", "C", "Z", "X", "G"), freq = c(1, 5, 7, 10, 0)) dtm <- document_term_matrix(x) x <- dtm_colsums(dtm) x x <- dtm_rowsums(dtm) head(x) ## ## Grouped column summation ## x <- list(doc1 = c("aa", "bb", "aa", "b"), doc2 = c("bb", "bb", "BB")) dtm <- document_term_matrix(x) dtm dtm_colsums(dtm, groups = list(combinedB = c("b", "bb"), combinedA = c("aa", "A"))) dtm_colsums(dtm, groups = list(combinedA = c("aa", "A"))) dtm_colsums(dtm, groups = list( combinedB = grep(pattern = "b", colnames(dtm), ignore.case = TRUE, value = TRUE), combinedA = c("aa", "A", "ZZZ"), test = character())) dtm_colsums(dtm, groups = list()) ## ## Grouped row summation ## x <- list(doc1 = c("aa", "bb", "aa", "b"), doc2 = c("bb", "bb", "BB"), doc3 = c("bb", "bb", "BB"), doc4 = c("bb", "bb", "BB", "b")) dtm <- document_term_matrix(x) dtm dtm_rowsums(dtm, groups = list(doc1 = "doc1", combi = c("doc2", "doc3", "doc4"))) dtm_rowsums(dtm, groups = list(unknown = "docUnknown", combi = c("doc2", "doc3", "doc4"))) dtm_rowsums(dtm, groups = list())
x <- data.frame( doc_id = c(1, 1, 2, 3, 4), term = c("A", "C", "Z", "X", "G"), freq = c(1, 5, 7, 10, 0)) dtm <- document_term_matrix(x) x <- dtm_colsums(dtm) x x <- dtm_rowsums(dtm) head(x) ## ## Grouped column summation ## x <- list(doc1 = c("aa", "bb", "aa", "b"), doc2 = c("bb", "bb", "BB")) dtm <- document_term_matrix(x) dtm dtm_colsums(dtm, groups = list(combinedB = c("b", "bb"), combinedA = c("aa", "A"))) dtm_colsums(dtm, groups = list(combinedA = c("aa", "A"))) dtm_colsums(dtm, groups = list( combinedB = grep(pattern = "b", colnames(dtm), ignore.case = TRUE, value = TRUE), combinedA = c("aa", "A", "ZZZ"), test = character())) dtm_colsums(dtm, groups = list()) ## ## Grouped row summation ## x <- list(doc1 = c("aa", "bb", "aa", "b"), doc2 = c("bb", "bb", "BB"), doc3 = c("bb", "bb", "BB"), doc4 = c("bb", "bb", "BB", "b")) dtm <- document_term_matrix(x) dtm dtm_rowsums(dtm, groups = list(doc1 = "doc1", combi = c("doc2", "doc3", "doc4"))) dtm_rowsums(dtm, groups = list(unknown = "docUnknown", combi = c("doc2", "doc3", "doc4"))) dtm_rowsums(dtm, groups = list())
Makes sure the document term matrix has exactly the rows and columns which you specify. If missing rows or columns are occurring, the function fills these up either with empty cells or with the value that you provide. See the examples.
dtm_conform(dtm, rows, columns, fill)
dtm_conform(dtm, rows, columns, fill)
dtm |
a document term matrix: an object returned by |
rows |
a character vector of row names which |
columns |
a character vector of column names which |
fill |
a value to use to fill up missing rows / columns. Defaults to using an empty cell. |
the sparse matrix dtm
with exactly the specified rows and columns
x <- data.frame(doc_id = c("doc_1", "doc_1", "doc_1", "doc_2"), text = c("a", "a", "b", "c"), stringsAsFactors = FALSE) dtm <- document_term_frequencies(x) dtm <- document_term_matrix(dtm) dtm dtm_conform(dtm, rows = c("doc_1", "doc_2", "doc_3"), columns = c("a", "b", "c", "Z", "Y")) dtm_conform(dtm, rows = c("doc_1", "doc_2", "doc_3"), columns = c("a", "b", "c", "Z", "Y"), fill = 1) dtm_conform(dtm, rows = c("doc_1", "doc_3"), columns = c("a", "b", "c", "Z", "Y")) dtm_conform(dtm, columns = c("a", "b", "Z")) dtm_conform(dtm, rows = c("doc_1")) dtm_conform(dtm, rows = character()) dtm_conform(dtm, columns = character()) dtm_conform(dtm, rows = character(), columns = character()) ## ## Some examples on border line cases ## special1 <- dtm[, character()] special2 <- dtm[character(), character()] special3 <- dtm[character(), ] dtm_conform(special1, rows = c("doc_1", "doc_2", "doc_3"), columns = c("a", "b", "c", "Z", "Y")) dtm_conform(special1, rows = c("doc_1", "doc_2", "doc_3"), columns = c("a", "b", "c", "Z", "Y"), fill = 1) dtm_conform(special1, rows = c("doc_1", "doc_3"), columns = c("a", "b", "c", "Z", "Y")) dtm_conform(special1, columns = c("a", "b", "Z")) dtm_conform(special1, rows = c("doc_1")) dtm_conform(special1, rows = character()) dtm_conform(special1, columns = character()) dtm_conform(special1, rows = character(), columns = character()) dtm_conform(special2, rows = c("doc_1", "doc_2", "doc_3"), columns = c("a", "b", "c", "Z", "Y")) dtm_conform(special2, rows = c("doc_1", "doc_2", "doc_3"), columns = c("a", "b", "c", "Z", "Y"), fill = 1) dtm_conform(special2, rows = c("doc_1", "doc_3"), columns = c("a", "b", "c", "Z", "Y")) dtm_conform(special2, columns = c("a", "b", "Z")) dtm_conform(special2, rows = c("doc_1")) dtm_conform(special2, rows = character()) dtm_conform(special2, columns = character()) dtm_conform(special2, rows = character(), columns = character()) dtm_conform(special3, rows = c("doc_1", "doc_2", "doc_3"), columns = c("a", "b", "c", "Z", "Y")) dtm_conform(special3, rows = c("doc_1", "doc_2", "doc_3"), columns = c("a", "b", "c", "Z", "Y"), fill = 1) dtm_conform(special3, rows = c("doc_1", "doc_3"), columns = c("a", "b", "c", "Z", "Y")) dtm_conform(special3, columns = c("a", "b", "Z")) dtm_conform(special3, rows = c("doc_1")) dtm_conform(special3, rows = character()) dtm_conform(special3, columns = character()) dtm_conform(special3, rows = character(), columns = character())
x <- data.frame(doc_id = c("doc_1", "doc_1", "doc_1", "doc_2"), text = c("a", "a", "b", "c"), stringsAsFactors = FALSE) dtm <- document_term_frequencies(x) dtm <- document_term_matrix(dtm) dtm dtm_conform(dtm, rows = c("doc_1", "doc_2", "doc_3"), columns = c("a", "b", "c", "Z", "Y")) dtm_conform(dtm, rows = c("doc_1", "doc_2", "doc_3"), columns = c("a", "b", "c", "Z", "Y"), fill = 1) dtm_conform(dtm, rows = c("doc_1", "doc_3"), columns = c("a", "b", "c", "Z", "Y")) dtm_conform(dtm, columns = c("a", "b", "Z")) dtm_conform(dtm, rows = c("doc_1")) dtm_conform(dtm, rows = character()) dtm_conform(dtm, columns = character()) dtm_conform(dtm, rows = character(), columns = character()) ## ## Some examples on border line cases ## special1 <- dtm[, character()] special2 <- dtm[character(), character()] special3 <- dtm[character(), ] dtm_conform(special1, rows = c("doc_1", "doc_2", "doc_3"), columns = c("a", "b", "c", "Z", "Y")) dtm_conform(special1, rows = c("doc_1", "doc_2", "doc_3"), columns = c("a", "b", "c", "Z", "Y"), fill = 1) dtm_conform(special1, rows = c("doc_1", "doc_3"), columns = c("a", "b", "c", "Z", "Y")) dtm_conform(special1, columns = c("a", "b", "Z")) dtm_conform(special1, rows = c("doc_1")) dtm_conform(special1, rows = character()) dtm_conform(special1, columns = character()) dtm_conform(special1, rows = character(), columns = character()) dtm_conform(special2, rows = c("doc_1", "doc_2", "doc_3"), columns = c("a", "b", "c", "Z", "Y")) dtm_conform(special2, rows = c("doc_1", "doc_2", "doc_3"), columns = c("a", "b", "c", "Z", "Y"), fill = 1) dtm_conform(special2, rows = c("doc_1", "doc_3"), columns = c("a", "b", "c", "Z", "Y")) dtm_conform(special2, columns = c("a", "b", "Z")) dtm_conform(special2, rows = c("doc_1")) dtm_conform(special2, rows = character()) dtm_conform(special2, columns = character()) dtm_conform(special2, rows = character(), columns = character()) dtm_conform(special3, rows = c("doc_1", "doc_2", "doc_3"), columns = c("a", "b", "c", "Z", "Y")) dtm_conform(special3, rows = c("doc_1", "doc_2", "doc_3"), columns = c("a", "b", "c", "Z", "Y"), fill = 1) dtm_conform(special3, rows = c("doc_1", "doc_3"), columns = c("a", "b", "c", "Z", "Y")) dtm_conform(special3, columns = c("a", "b", "Z")) dtm_conform(special3, rows = c("doc_1")) dtm_conform(special3, rows = character()) dtm_conform(special3, columns = character()) dtm_conform(special3, rows = character(), columns = character())
Pearson Correlation for Sparse Matrices.
More memory and time-efficient than cor(as.matrix(x))
.
dtm_cor(x)
dtm_cor(x)
x |
A matrix, potentially a sparse matrix such as a "dgCMatrix" object
which is returned by |
a correlation matrix
x <- data.frame( doc_id = c(1, 1, 2, 3, 4), term = c("A", "C", "Z", "X", "G"), freq = c(1, 5, 7, 10, 0)) dtm <- document_term_matrix(x) dtm_cor(dtm)
x <- data.frame( doc_id = c(1, 1, 2, 3, 4), term = c("A", "C", "Z", "X", "G"), freq = c(1, 5, 7, 10, 0)) dtm <- document_term_matrix(x) dtm_cor(dtm)
Remove terms occurring with low frequency from a Document-Term-Matrix and documents with no terms
dtm_remove_lowfreq(dtm, minfreq = 5, maxterms, remove_emptydocs = TRUE)
dtm_remove_lowfreq(dtm, minfreq = 5, maxterms, remove_emptydocs = TRUE)
dtm |
an object returned by |
minfreq |
integer with the minimum number of times the term should occur in order to keep the term |
maxterms |
integer indicating the maximum number of terms which should be kept in the |
remove_emptydocs |
logical indicating to remove documents containing no more terms after the term removal is executed. Defaults to |
a sparse Matrix as returned by sparseMatrix
where terms with low occurrence are removed and documents without any terms are also removed
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, xpos == "NN") x <- x[, c("doc_id", "lemma")] x <- document_term_frequencies(x) dtm <- document_term_matrix(x) ## Remove terms with low frequencies and documents with no terms x <- dtm_remove_lowfreq(dtm, minfreq = 10) dim(x) x <- dtm_remove_lowfreq(dtm, minfreq = 10, maxterms = 25) dim(x) x <- dtm_remove_lowfreq(dtm, minfreq = 10, maxterms = 25, remove_emptydocs = FALSE) dim(x)
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, xpos == "NN") x <- x[, c("doc_id", "lemma")] x <- document_term_frequencies(x) dtm <- document_term_matrix(x) ## Remove terms with low frequencies and documents with no terms x <- dtm_remove_lowfreq(dtm, minfreq = 10) dim(x) x <- dtm_remove_lowfreq(dtm, minfreq = 10, maxterms = 25) dim(x) x <- dtm_remove_lowfreq(dtm, minfreq = 10, maxterms = 25, remove_emptydocs = FALSE) dim(x)
Remove terms with high sparsity from a Document-Term-Matrix and remove documents with no terms.
Sparsity indicates in how many documents the term is not occurring.
dtm_remove_sparseterms(dtm, sparsity = 0.99, remove_emptydocs = TRUE)
dtm_remove_sparseterms(dtm, sparsity = 0.99, remove_emptydocs = TRUE)
dtm |
an object returned by |
sparsity |
numeric in 0-1 range indicating the sparsity percent. Defaults to 0.99 meaning drop terms which occur in less than 1 percent of the documents. |
remove_emptydocs |
logical indicating to remove documents containing no more terms after the term removal is executed. Defaults to |
a sparse Matrix as returned by sparseMatrix
where terms with high sparsity are removed and documents without any terms are also removed
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, xpos == "NN") x <- x[, c("doc_id", "lemma")] x <- document_term_frequencies(x) dtm <- document_term_matrix(x) ## Remove terms with low frequencies and documents with no terms x <- dtm_remove_sparseterms(dtm, sparsity = 0.99) dim(x) x <- dtm_remove_sparseterms(dtm, sparsity = 0.99, remove_emptydocs = FALSE) dim(x)
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, xpos == "NN") x <- x[, c("doc_id", "lemma")] x <- document_term_frequencies(x) dtm <- document_term_matrix(x) ## Remove terms with low frequencies and documents with no terms x <- dtm_remove_sparseterms(dtm, sparsity = 0.99) dim(x) x <- dtm_remove_sparseterms(dtm, sparsity = 0.99, remove_emptydocs = FALSE) dim(x)
Remove terms from a Document-Term-Matrix and keep only documents which have a least some terms
dtm_remove_terms(dtm, terms, remove_emptydocs = TRUE)
dtm_remove_terms(dtm, terms, remove_emptydocs = TRUE)
dtm |
an object returned by |
terms |
a character vector of terms which are in |
remove_emptydocs |
logical indicating to remove documents containing no more terms after the term removal is executed. Defaults to |
a sparse Matrix as returned by sparseMatrix
where the indicated terms are removed as well as documents with no terms whatsoever
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, xpos == "NN") x <- x[, c("doc_id", "lemma")] x <- document_term_frequencies(x) dtm <- document_term_matrix(x) dim(dtm) x <- dtm_remove_terms(dtm, terms = c("appartement", "casa", "centrum", "ciudad")) dim(x) x <- dtm_remove_terms(dtm, terms = c("appartement", "casa", "centrum", "ciudad"), remove_emptydocs = FALSE) dim(x)
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, xpos == "NN") x <- x[, c("doc_id", "lemma")] x <- document_term_frequencies(x) dtm <- document_term_matrix(x) dim(dtm) x <- dtm_remove_terms(dtm, terms = c("appartement", "casa", "centrum", "ciudad")) dim(x) x <- dtm_remove_terms(dtm, terms = c("appartement", "casa", "centrum", "ciudad"), remove_emptydocs = FALSE) dim(x)
Remove terms from a Document-Term-Matrix and documents with no terms based on the term frequency inverse document frequency.
Either giving in the maximum number of terms (argument top
), the tfidf cutoff (argument cutoff
)
or a quantile (argument prob
)
dtm_remove_tfidf(dtm, top, cutoff, prob, remove_emptydocs = TRUE)
dtm_remove_tfidf(dtm, top, cutoff, prob, remove_emptydocs = TRUE)
dtm |
an object returned by |
top |
integer with the number of terms which should be kept as defined by the highest mean tfidf |
cutoff |
numeric cutoff value to keep only terms in |
prob |
numeric quantile indicating to keep only terms in |
remove_emptydocs |
logical indicating to remove documents containing no more terms after the term removal is executed. Defaults to |
a sparse Matrix as returned by sparseMatrix
where terms with high tfidf are kept and documents without any remaining terms are removed
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, xpos == "NN") x <- x[, c("doc_id", "lemma")] x <- document_term_frequencies(x) dtm <- document_term_matrix(x) dtm <- dtm_remove_lowfreq(dtm, minfreq = 10) dim(dtm) ## Keep only terms with high tfidf x <- dtm_remove_tfidf(dtm, top=50) dim(x) x <- dtm_remove_tfidf(dtm, top=50, remove_emptydocs = FALSE) dim(x) ## Keep only terms with tfidf above 1.1 x <- dtm_remove_tfidf(dtm, cutoff=1.1) dim(x) ## Keep only terms with tfidf above the 60 percent quantile x <- dtm_remove_tfidf(dtm, prob=0.6) dim(x)
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, xpos == "NN") x <- x[, c("doc_id", "lemma")] x <- document_term_frequencies(x) dtm <- document_term_matrix(x) dtm <- dtm_remove_lowfreq(dtm, minfreq = 10) dim(dtm) ## Keep only terms with high tfidf x <- dtm_remove_tfidf(dtm, top=50) dim(x) x <- dtm_remove_tfidf(dtm, top=50, remove_emptydocs = FALSE) dim(x) ## Keep only terms with tfidf above 1.1 x <- dtm_remove_tfidf(dtm, cutoff=1.1) dim(x) ## Keep only terms with tfidf above the 60 percent quantile x <- dtm_remove_tfidf(dtm, prob=0.6) dim(x)
Inverse operation of the document_term_matrix
function.
Creates frequency table which contains 1 row per document/term
dtm_reverse(x)
dtm_reverse(x)
x |
an object as returned by |
a data.frame with columns doc_id, term and freq where freq is just the value in each
cell of the x
x <- data.frame( doc_id = c(1, 1, 2, 3, 4), term = c("A", "C", "Z", "X", "G"), freq = c(1, 5, 7, 10, 0)) dtm <- document_term_matrix(x) dtm_reverse(dtm)
x <- data.frame( doc_id = c(1, 1, 2, 3, 4), term = c("A", "C", "Z", "X", "G"), freq = c(1, 5, 7, 10, 0)) dtm <- document_term_matrix(x) dtm_reverse(dtm)
Sample the specified number of rows from the Document-Term-Matrix using either with or without replacement.
dtm_sample(dtm, size = nrow(dtm), replace = FALSE, prob = NULL)
dtm_sample(dtm, size = nrow(dtm), replace = FALSE, prob = NULL)
dtm |
a document term matrix of class dgCMatrix (which can be an object returned by |
size |
a positive number, the number of rows to sample |
replace |
should sampling be with replacement |
prob |
a vector of probability weights, one for each row of |
dtm
with as many rows as specified in size
x <- list(doc1 = c("aa", "bb", "cc", "aa", "b"), doc2 = c("bb", "bb", "dd", ""), doc3 = character(), doc4 = c("cc", NA), doc5 = character()) dtm <- document_term_matrix(x) dtm_sample(dtm, size = 2) dtm_sample(dtm, size = 3) dtm_sample(dtm, size = 2) dtm_sample(dtm, size = 8, replace = TRUE) dtm_sample(dtm, size = 8, replace = TRUE, prob = c(1, 1, 0.01, 0.5, 0.01))
x <- list(doc1 = c("aa", "bb", "cc", "aa", "b"), doc2 = c("bb", "bb", "dd", ""), doc3 = character(), doc4 = c("cc", NA), doc5 = character()) dtm <- document_term_matrix(x) dtm_sample(dtm, size = 2) dtm_sample(dtm, size = 3) dtm_sample(dtm, size = 2) dtm_sample(dtm, size = 8, replace = TRUE) dtm_sample(dtm, size = 8, replace = TRUE, prob = c(1, 1, 0.01, 0.5, 0.01))
Calculate the similarity of a document term matrix to a set of terms based on
a Singular Value Decomposition (SVD) embedding matrix.
This can be used to easily construct a sentiment score based on the latent scale defined by a set of positive or negative terms.
dtm_svd_similarity( dtm, embedding, weights, terminology = rownames(embedding), type = c("cosine", "dot") )
dtm_svd_similarity( dtm, embedding, weights, terminology = rownames(embedding), type = c("cosine", "dot") )
dtm |
a sparse matrix such as a "dgCMatrix" object which is returned by |
embedding |
a matrix containing the |
weights |
a numeric vector with weights giving your definition of which terms are positive or negative, The names of this vector should be terms available in the rownames of the embedding matrix. See the examples. |
terminology |
a character vector of terms to limit the calculation of the similarity for the |
type |
either 'cosine' or 'dot' indicating to respectively calculate cosine similarities or inner product similarities between the |
an object of class 'svd_similarity' which is a list with elements
weights: The weights used. These are scaled to sum up to 1 as well on the positive as the negative side
type: The type of similarity calculated (either 'cosine' or 'dot')
terminology: A data.frame with columns term, freq and similarity where similarity indicates
the similarity between the term and the SVD embedding space of the weights and freq is how frequently the term occurs in the dtm
.
This dataset is sorted in descending order by similarity.
similarity: A data.frame with columns doc_id and similarity indicating the similarity between
the dtm
and the SVD embedding space of the weights. The doc_id is the identifier taken from the rownames of dtm
.
scale: A list with elements terminology and weights
indicating respectively the similarity in the SVD embedding space
between the terminology
and each of the weights and between the weight terms itself
https://en.wikipedia.org/wiki/Latent_semantic_analysis
data("brussels_reviews_anno", package = "udpipe") x <- subset(brussels_reviews_anno, language %in% "nl" & (upos %in% "ADJ" | lemma %in% "niet")) dtm <- document_term_frequencies(x, document = "doc_id", term = "lemma") dtm <- document_term_matrix(dtm) dtm <- dtm_remove_lowfreq(dtm, minfreq = 3) ## Function performing Singular Value Decomposition on sparse/dense data dtm_svd <- function(dtm, dim = 5, type = c("RSpectra", "svd"), ...){ type <- match.arg(type) if(type == "svd"){ SVD <- svd(dtm, nu = 0, nv = dim, ...) }else if(type == "RSpectra"){ #Uncomment this if you want to use the faster sparse SVD by RSpectra #SVD <- RSpectra::svds(dtm, nu = 0, k = dim, ...) } rownames(SVD$v) <- colnames(dtm) SVD$v } #embedding <- dtm_svd(dtm, dim = 5) embedding <- dtm_svd(dtm, dim = 5, type = "svd") ## Define positive / negative terms and calculate the similarity to these weights <- setNames(c(1, 1, 1, 1, -1, -1, -1, -1), c("fantastisch", "schoon", "vriendelijk", "net", "lawaaiig", "lastig", "niet", "slecht")) scores <- dtm_svd_similarity(dtm, embedding = embedding, weights = weights) scores str(scores$similarity) hist(scores$similarity$similarity) plot(scores$terminology$similarity_weight, log(scores$terminology$freq), type = "n") text(scores$terminology$similarity_weight, log(scores$terminology$freq), labels = scores$terminology$term) ## Not run: ## More elaborate example using word2vec ## building word2vec model on all Dutch texts, ## finding similarity of dtm to adjectives only set.seed(123) library(word2vec) text <- subset(brussels_reviews_anno, language == "nl") text <- paste.data.frame(text, term = "lemma", group = "doc_id") text <- text$lemma model <- word2vec(text, dim = 10, iter = 20, type = "cbow", min_count = 1) predict(model, newdata = names(weights), type = "nearest", top_n = 3) embedding <- as.matrix(model) ## End(Not run) data(brussels_reviews_w2v_embeddings_lemma_nl) embedding <- brussels_reviews_w2v_embeddings_lemma_nl adjective <- subset(brussels_reviews_anno, language %in% "nl" & upos %in% "ADJ") adjective <- txt_freq(adjective$lemma) adjective <- subset(adjective, freq >= 5 & nchar(key) > 1) adjective <- adjective$key scores <- dtm_svd_similarity(dtm, embedding, weights = weights, type = "dot", terminology = adjective) scores plot(scores$terminology$similarity_weight, log(scores$terminology$freq), type = "n") text(scores$terminology$similarity_weight, log(scores$terminology$freq), labels = scores$terminology$term, cex = 0.8)
data("brussels_reviews_anno", package = "udpipe") x <- subset(brussels_reviews_anno, language %in% "nl" & (upos %in% "ADJ" | lemma %in% "niet")) dtm <- document_term_frequencies(x, document = "doc_id", term = "lemma") dtm <- document_term_matrix(dtm) dtm <- dtm_remove_lowfreq(dtm, minfreq = 3) ## Function performing Singular Value Decomposition on sparse/dense data dtm_svd <- function(dtm, dim = 5, type = c("RSpectra", "svd"), ...){ type <- match.arg(type) if(type == "svd"){ SVD <- svd(dtm, nu = 0, nv = dim, ...) }else if(type == "RSpectra"){ #Uncomment this if you want to use the faster sparse SVD by RSpectra #SVD <- RSpectra::svds(dtm, nu = 0, k = dim, ...) } rownames(SVD$v) <- colnames(dtm) SVD$v } #embedding <- dtm_svd(dtm, dim = 5) embedding <- dtm_svd(dtm, dim = 5, type = "svd") ## Define positive / negative terms and calculate the similarity to these weights <- setNames(c(1, 1, 1, 1, -1, -1, -1, -1), c("fantastisch", "schoon", "vriendelijk", "net", "lawaaiig", "lastig", "niet", "slecht")) scores <- dtm_svd_similarity(dtm, embedding = embedding, weights = weights) scores str(scores$similarity) hist(scores$similarity$similarity) plot(scores$terminology$similarity_weight, log(scores$terminology$freq), type = "n") text(scores$terminology$similarity_weight, log(scores$terminology$freq), labels = scores$terminology$term) ## Not run: ## More elaborate example using word2vec ## building word2vec model on all Dutch texts, ## finding similarity of dtm to adjectives only set.seed(123) library(word2vec) text <- subset(brussels_reviews_anno, language == "nl") text <- paste.data.frame(text, term = "lemma", group = "doc_id") text <- text$lemma model <- word2vec(text, dim = 10, iter = 20, type = "cbow", min_count = 1) predict(model, newdata = names(weights), type = "nearest", top_n = 3) embedding <- as.matrix(model) ## End(Not run) data(brussels_reviews_w2v_embeddings_lemma_nl) embedding <- brussels_reviews_w2v_embeddings_lemma_nl adjective <- subset(brussels_reviews_anno, language %in% "nl" & upos %in% "ADJ") adjective <- txt_freq(adjective$lemma) adjective <- subset(adjective, freq >= 5 & nchar(key) > 1) adjective <- adjective$key scores <- dtm_svd_similarity(dtm, embedding, weights = weights, type = "dot", terminology = adjective) scores plot(scores$terminology$similarity_weight, log(scores$terminology$freq), type = "n") text(scores$terminology$similarity_weight, log(scores$terminology$freq), labels = scores$terminology$term, cex = 0.8)
Term Frequency - Inverse Document Frequency calculation. Averaged by each term.
dtm_tfidf(dtm)
dtm_tfidf(dtm)
dtm |
an object returned by |
a vector with tfidf values, one for each term in the dtm
matrix
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, xpos == "NN") x <- x[, c("doc_id", "lemma")] x <- document_term_frequencies(x) dtm <- document_term_matrix(x) ## Calculate tfidf tfidf <- dtm_tfidf(dtm) hist(tfidf, breaks = "scott") head(sort(tfidf, decreasing = TRUE)) head(sort(tfidf, decreasing = FALSE))
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, xpos == "NN") x <- x[, c("doc_id", "lemma")] x <- document_term_frequencies(x) dtm <- document_term_matrix(x) ## Calculate tfidf tfidf <- dtm_tfidf(dtm) hist(tfidf, breaks = "scott") head(sort(tfidf, decreasing = TRUE)) head(sort(tfidf, decreasing = FALSE))
Collocations are a sequence of words or terms that co-occur more often than would be expected by chance.
Common collocation are adjectives + nouns, nouns followed by nouns, verbs and nouns, adverbs and adjectives,
verbs and prepositional phrases or verbs and adverbs.
This function extracts relevant collocations and computes the following statistics on them
which are indicators of how likely two terms are collocated compared to being independent.
PMI (pointwise mutual information): log2(P(w1w2) / P(w1) P(w2))
MD (mutual dependency): log2(P(w1w2)^2 / P(w1) P(w2))
LFMD (log-frequency biased mutual dependency): MD + log2(P(w1w2))
As natural language is non random - otherwise you wouldn't understand what I'm saying, most of the combinations of terms are significant. That's why these indicators of collocation are merely used to order the collocations.
keywords_collocation(x, term, group, ngram_max = 2, n_min = 2, sep = " ") collocation(x, term, group, ngram_max = 2, n_min = 2, sep = " ")
keywords_collocation(x, term, group, ngram_max = 2, n_min = 2, sep = " ") collocation(x, term, group, ngram_max = 2, n_min = 2, sep = " ")
x |
a data.frame with one row per term where the sequence of the terms correspond to
the natural order of a text. The data frame |
term |
a character vector with 1 column from |
group |
a character vector with 1 or several columns from |
ngram_max |
integer indicating the size of the collocations. Defaults to 2, indicating to compute bigrams. If set to 3, will find collocations of bigrams and trigrams. |
n_min |
integer indicating the frequency of how many times a collocation should at least occur in the data in order to be returned. Defaults to 2. |
sep |
character string with the separator which will be used to |
a data.frame with columns
keyword: the terms which are combined as a collocation
ngram: the number of terms which are combined
left: the left term of the collocation
right: the right term of the collocation
freq: the number of times the collocation occurred in the data
freq_left: the number of times the left element of the collocation occurred in the data
freq_right: the number of times the right element of the collocation occurred in the data
pmi: the pointwise mutual information
md: mutual dependency
lfmd: log-frequency biased mutual dependency
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, language %in% "fr") colloc <- keywords_collocation(x, term = "lemma", group = c("doc_id", "sentence_id"), ngram_max = 3, n_min = 10) head(colloc, 10) ## Example on finding collocations of nouns preceded by an adjective library(data.table) x <- as.data.table(x) x <- x[, xpos_previous := txt_previous(xpos, n = 1), by = list(doc_id, sentence_id)] x <- x[, xpos_next := txt_next(xpos, n = 1), by = list(doc_id, sentence_id)] x <- subset(x, (xpos %in% c("NN") & xpos_previous %in% c("JJ")) | (xpos %in% c("JJ") & xpos_next %in% c("NN"))) colloc <- keywords_collocation(x, term = "lemma", group = c("doc_id", "sentence_id"), ngram_max = 2, n_min = 2) head(colloc)
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, language %in% "fr") colloc <- keywords_collocation(x, term = "lemma", group = c("doc_id", "sentence_id"), ngram_max = 3, n_min = 10) head(colloc, 10) ## Example on finding collocations of nouns preceded by an adjective library(data.table) x <- as.data.table(x) x <- x[, xpos_previous := txt_previous(xpos, n = 1), by = list(doc_id, sentence_id)] x <- x[, xpos_next := txt_next(xpos, n = 1), by = list(doc_id, sentence_id)] x <- subset(x, (xpos %in% c("NN") & xpos_previous %in% c("JJ")) | (xpos %in% c("JJ") & xpos_next %in% c("NN"))) colloc <- keywords_collocation(x, term = "lemma", group = c("doc_id", "sentence_id"), ngram_max = 2, n_min = 2) head(colloc)
This function allows to extract phrases, like simple noun phrases, complex noun phrases
or any exact sequence of parts of speech tag patterns.
An example use case of this is to get all text where an adjective is followed by a noun or
for example to get all phrases consisting of a preposition which is followed by a noun which is next followed by a verb.
More complex patterns are shown in the details below.
keywords_phrases( x, term = x, pattern, is_regex = FALSE, sep = " ", ngram_max = 8, detailed = TRUE ) phrases( x, term = x, pattern, is_regex = FALSE, sep = " ", ngram_max = 8, detailed = TRUE )
keywords_phrases( x, term = x, pattern, is_regex = FALSE, sep = " ", ngram_max = 8, detailed = TRUE ) phrases( x, term = x, pattern, is_regex = FALSE, sep = " ", ngram_max = 8, detailed = TRUE )
x |
a character vector of Parts of Speech tags where we want to locate a relevant sequence of POS tags as defined in |
term |
a character vector of the same length as |
pattern |
In case |
is_regex |
logical indicating if |
sep |
character indicating how to collapse the phrase of terms which are found. Defaults to using a space. |
ngram_max |
an integer indicating to allow phrases to be found up to |
detailed |
logical indicating to return the exact positions where the phrase was found (set to |
Common phrases which you might be interested in and which can be supplied to pattern
are
Simple noun phrase: "(A|N)*N(P+D*(A|N)*N)*"
Simple verb Phrase: "((A|N)*N(P+D*(A|N)*N)*P*(M|V)*V(M|V)*|(M|V)*V(M|V)*D*(A|N)*N(P+D*(A|N)*N)*|(M|V)*V(M|V)*(P+D*(A|N)*N)+|(A|N)*N(P+D*(A|N)*N)*P*((M|V)*V(M|V)*D*(A|N)*N(P+D*(A|N)*N)*|(M|V)*V(M|V)*(P+D*(A|N)*N)+))"
Noun hrase with coordination conjuction: "((A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*(C(D(CD)*)*(A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*)*)"
Verb phrase with coordination conjuction: "(((A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*(C(D(CD)*)*(A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*)*)(P(CP)*)*(M(CM)*|V)*V(M(CM)*|V)*(C(M(CM)*|V)*V(M(CM)*|V)*)*|(M(CM)*|V)*V(M(CM)*|V)*(C(M(CM)*|V)*V(M(CM)*|V)*)*(D(CD)*)*((A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*(C(D(CD)*)*(A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*)*)|(M(CM)*|V)*V(M(CM)*|V)*(C(M(CM)*|V)*V(M(CM)*|V)*)*((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)+|((A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*(C(D(CD)*)*(A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*)*)(P(CP)*)*((M(CM)*|V)*V(M(CM)*|V)*(C(M(CM)*|V)*V(M(CM)*|V)*)*(D(CD)*)*((A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*(C(D(CD)*)*(A(CA)*|N)*N((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)*)*)|(M(CM)*|V)*V(M(CM)*|V)*(C(M(CM)*|V)*V(M(CM)*|V)*)*((P(CP)*)+(D(CD)*)*(A(CA)*|N)*N)+))"
See the examples.
Mark that this functionality is also implemented in the phrasemachine package where it is implemented using plain R code,
while the implementation in this package uses a more quick Rcpp implementation for
extracting these kind of regular expression like phrases.
If argument detailed
is set to TRUE
a data.frame with columns
keyword: the phrase which corresponds to the collapsed terms of where the pattern was found
ngram: the length of the phrase
pattern: the pattern which was found
start: the starting index of x
where the pattern was found
end: the ending index of x
where the pattern was found
If argument detailed
is set to FALSE
will return aggregate frequency statistics in a data.frame containing the columns keyword,
ngram and freq (how many time it is occurring)
data(brussels_reviews_anno, package = "udpipe") x <- subset(brussels_reviews_anno, language %in% "fr") ## Find exactly this sequence of POS tags np <- keywords_phrases(x$xpos, pattern = c("DT", "NN", "VB", "RB", "JJ"), sep = "-") head(np) np <- keywords_phrases(x$xpos, pattern = c("DT", "NN", "VB", "RB", "JJ"), term = x$token) head(np) ## Find noun phrases with the following regular expression: (A|N)+N(P+D*(A|N)*N)* x$phrase_tag <- as_phrasemachine(x$xpos, type = "penn-treebank") nounphrases <- keywords_phrases(x$phrase_tag, term = x$token, pattern = "(A|N)+N(P+D*(A|N)*N)*", is_regex = TRUE, ngram_max = 4, detailed = TRUE) head(nounphrases, 10) head(sort(table(nounphrases$keyword), decreasing=TRUE), 20) ## Find frequent sequences of POS tags library(data.table) x <- as.data.table(x) x <- x[, pos_sequence := txt_nextgram(x = xpos, n = 3), by = list(doc_id, sentence_id)] tail(sort(table(x$pos_sequence))) np <- keywords_phrases(x$xpos, term = x$token, pattern = c("IN", "DT", "NN")) head(np)
data(brussels_reviews_anno, package = "udpipe") x <- subset(brussels_reviews_anno, language %in% "fr") ## Find exactly this sequence of POS tags np <- keywords_phrases(x$xpos, pattern = c("DT", "NN", "VB", "RB", "JJ"), sep = "-") head(np) np <- keywords_phrases(x$xpos, pattern = c("DT", "NN", "VB", "RB", "JJ"), term = x$token) head(np) ## Find noun phrases with the following regular expression: (A|N)+N(P+D*(A|N)*N)* x$phrase_tag <- as_phrasemachine(x$xpos, type = "penn-treebank") nounphrases <- keywords_phrases(x$phrase_tag, term = x$token, pattern = "(A|N)+N(P+D*(A|N)*N)*", is_regex = TRUE, ngram_max = 4, detailed = TRUE) head(nounphrases, 10) head(sort(table(nounphrases$keyword), decreasing=TRUE), 20) ## Find frequent sequences of POS tags library(data.table) x <- as.data.table(x) x <- x[, pos_sequence := txt_nextgram(x = xpos, n = 3), by = list(doc_id, sentence_id)] tail(sort(table(x$pos_sequence))) np <- keywords_phrases(x$xpos, term = x$token, pattern = c("IN", "DT", "NN")) head(np)
RAKE is a basic algorithm which tries to identify keywords in text. Keywords are
defined as a sequence of words following one another.
The algorithm goes as follows.
candidate keywords are extracted by looking to a contiguous sequence of words which do not contain irrelevant words
a score is being calculated for each word which is part of any candidate keyword, this is done by
among the words of the candidate keywords, the algorithm looks how many times each word is occurring and how many times it co-occurs with other words
each word gets a score which is the ratio of the word degree (how many times it co-occurs with other words) to the word frequency
a RAKE score for the full candidate keyword is calculated by summing up the scores of each of the words which define the candidate keyword
The resulting keywords are returned as a data.frame together with their RAKE score.
keywords_rake( x, term, group, relevant = rep(TRUE, nrow(x)), ngram_max = 2, n_min = 2, sep = " " )
keywords_rake( x, term, group, relevant = rep(TRUE, nrow(x)), ngram_max = 2, n_min = 2, sep = " " )
x |
a data.frame with one row per term as returned by |
term |
character string with a column in the data frame |
group |
a character vector with 1 or several columns from |
relevant |
a logical vector of the same length as |
ngram_max |
integer indicating the maximum number of words that there should be in each keyword |
n_min |
integer indicating the frequency of how many times a keywords should at least occur in the data in order to be returned. Defaults to 2. |
sep |
character string with the separator which will be used to |
a data.frame with columns keyword, ngram and rake which is ordered from low to high rake
keyword: the keyword
ngram: how many terms are in the keyword
freq: how many times did the keyword occur
rake: the ratio of the degree to the frequency as explained in the description, summed up for all words from the keyword
Rose, Stuart & Engel, Dave & Cramer, Nick & Cowley, Wendy. (2010). Automatic Keyword Extraction from Individual Documents. Text Mining: Applications and Theory. 1 - 20. 10.1002/9780470689646.ch1.
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, language == "nl") keywords <- keywords_rake(x = x, term = "lemma", group = "doc_id", relevant = x$xpos %in% c("NN", "JJ")) head(keywords) x <- subset(brussels_reviews_anno, language == "fr") keywords <- keywords_rake(x = x, term = "lemma", group = c("doc_id", "sentence_id"), relevant = x$xpos %in% c("NN", "JJ"), ngram_max = 10, n_min = 2, sep = "-") head(keywords)
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, language == "nl") keywords <- keywords_rake(x = x, term = "lemma", group = "doc_id", relevant = x$xpos %in% c("NN", "JJ")) head(keywords) x <- subset(brussels_reviews_anno, language == "fr") keywords <- keywords_rake(x = x, term = "lemma", group = c("doc_id", "sentence_id"), relevant = x$xpos %in% c("NN", "JJ"), ngram_max = 10, n_min = 2, sep = "-") head(keywords)
This function is similar to paste
but works on a data.frame, hence paste.data.frame.
It concatenates text belonging to groups of data together in one string.
The function is the inverse operation of strsplit.data.frame
.
paste.data.frame(data, term, group, collapse = " ")
paste.data.frame(data, term, group, collapse = " ")
data |
a data.frame or data.table |
term |
a string with a column name or a character vector of column names from |
group |
a string with a column name or a character vector of column names from |
collapse |
a character string that you want to use to collapse the text data together. Defaults to a single space. |
A data.frame with 1 row per group containing the columns from group
and term
where all the text in term
for each group will be paste-d
together, separated by the collapse
argument.
data(brussels_reviews_anno, package = "udpipe") head(brussels_reviews_anno) x <- paste.data.frame(brussels_reviews_anno, term = "lemma", group = c("doc_id", "sentence_id")) str(x) x <- paste.data.frame(brussels_reviews_anno, term = c("lemma", "token"), group = c("doc_id", "sentence_id"), collapse = "-") str(x)
data(brussels_reviews_anno, package = "udpipe") head(brussels_reviews_anno) x <- paste.data.frame(brussels_reviews_anno, term = "lemma", group = c("doc_id", "sentence_id")) str(x) x <- paste.data.frame(brussels_reviews_anno, term = c("lemma", "token"), group = c("doc_id", "sentence_id"), collapse = "-") str(x)
Gives either the predictions to which topic a document belongs or
the term posteriors by topic indicating which terms are emitted by each topic.
If you provide in newdata
a document term matrix
for which a document does not contain any text and hence does not have any terms with nonzero entries,
the prediction will give as topic prediction NA values (see the examples).
## S3 method for class 'LDA_VEM' predict( object, newdata, type = c("topics", "terms"), min_posterior = -1, min_terms = 0, labels, ... ) ## S3 method for class 'LDA_Gibbs' predict( object, newdata, type = c("topics", "terms"), min_posterior = -1, min_terms = 0, labels, ... )
## S3 method for class 'LDA_VEM' predict( object, newdata, type = c("topics", "terms"), min_posterior = -1, min_terms = 0, labels, ... ) ## S3 method for class 'LDA_Gibbs' predict( object, newdata, type = c("topics", "terms"), min_posterior = -1, min_terms = 0, labels, ... )
object |
an object of class LDA_VEM or LDA_Gibbs as returned by |
newdata |
a document/term matrix containing data for which to make a prediction |
type |
either 'topic' or 'terms' for the topic predictions or the term posteriors |
min_posterior |
numeric in 0-1 range to output only terms emitted by each
topic which have a posterior probability equal or higher than |
min_terms |
integer indicating the minimum number of terms to keep in the output when |
labels |
a character vector of the same length as the number of topics in the topic model. Indicating how to label the topics. Only valid for type = 'topic'. Defaults to topic_prob_001 up to topic_prob_999. |
... |
further arguments passed on to topicmodels::posterior |
in case of type = 'topic': a data.table with columns doc_id, topic (the topic number to which the document is assigned to), topic_label (the topic label) topic_prob (the posterior probability score for that topic), topic_probdiff_2nd (the probability score for that topic - the probability score for the 2nd highest topic) and the probability scores for each topic as indicated by topic_labelyourownlabel
n case of type = 'terms': a list of data.frames with columns term and prob, giving the posterior probability that each term is emitted by the topic
## Build document/term matrix on dutch nouns data(brussels_reviews_anno) data(brussels_reviews) x <- subset(brussels_reviews_anno, language == "nl") x <- subset(x, xpos %in% c("JJ")) x <- x[, c("doc_id", "lemma")] x <- document_term_frequencies(x) dtm <- document_term_matrix(x) dtm <- dtm_remove_lowfreq(dtm, minfreq = 10) dtm <- dtm_remove_tfidf(dtm, top = 100) ## Fit a topicmodel using VEM library(topicmodels) mymodel <- LDA(x = dtm, k = 4, method = "VEM") ## Get topic terminology terminology <- predict(mymodel, type = "terms", min_posterior = 0.05, min_terms = 3) terminology ## Get scores alongside the topic model dtm <- document_term_matrix(x, vocabulary = mymodel@terms) scores <- predict(mymodel, newdata = dtm, type = "topics") scores <- predict(mymodel, newdata = dtm, type = "topics", labels = c("mylabel1", "xyz", "app-location", "newlabel")) head(scores) table(scores$topic) table(scores$topic_label) table(scores$topic, exclude = c()) table(scores$topic_label, exclude = c()) ## Fit a topicmodel using Gibbs library(topicmodels) mymodel <- LDA(x = dtm, k = 4, method = "Gibbs") terminology <- predict(mymodel, type = "terms", min_posterior = 0.05, min_terms = 3) scores <- predict(mymodel, type = "topics", newdata = dtm)
## Build document/term matrix on dutch nouns data(brussels_reviews_anno) data(brussels_reviews) x <- subset(brussels_reviews_anno, language == "nl") x <- subset(x, xpos %in% c("JJ")) x <- x[, c("doc_id", "lemma")] x <- document_term_frequencies(x) dtm <- document_term_matrix(x) dtm <- dtm_remove_lowfreq(dtm, minfreq = 10) dtm <- dtm_remove_tfidf(dtm, top = 100) ## Fit a topicmodel using VEM library(topicmodels) mymodel <- LDA(x = dtm, k = 4, method = "VEM") ## Get topic terminology terminology <- predict(mymodel, type = "terms", min_posterior = 0.05, min_terms = 3) terminology ## Get scores alongside the topic model dtm <- document_term_matrix(x, vocabulary = mymodel@terms) scores <- predict(mymodel, newdata = dtm, type = "topics") scores <- predict(mymodel, newdata = dtm, type = "topics", labels = c("mylabel1", "xyz", "app-location", "newlabel")) head(scores) table(scores$topic) table(scores$topic_label) table(scores$topic, exclude = c()) table(scores$topic_label, exclude = c()) ## Fit a topicmodel using Gibbs library(topicmodels) mymodel <- LDA(x = dtm, k = 4, method = "Gibbs") terminology <- predict(mymodel, type = "terms", min_posterior = 0.05, min_terms = 3) scores <- predict(mymodel, type = "topics", newdata = dtm)
Obtain a tokenised data frame by splitting text alongside a regular expression.
This is the inverse operation of paste.data.frame
.
strsplit.data.frame( data, term, group, split = "[[:space:][:punct:][:digit:]]+", ... )
strsplit.data.frame( data, term, group, split = "[[:space:][:punct:][:digit:]]+", ... )
data |
a data.frame or data.table |
term |
a character with a column name from |
group |
a string with a column name or a character vector of column names from |
split |
a regular expression indicating how to split the |
... |
further arguments passed on to |
A tokenised data frame containing one row per token.
This data.frame has the columns from group
and term
where the text in column term
will be split by the provided regular expression into tokens.
data(brussels_reviews, package = "udpipe") x <- strsplit.data.frame(brussels_reviews, term = "feedback", group = "id") head(x) x <- strsplit.data.frame(brussels_reviews, term = c("feedback"), group = c("listing_id", "language")) head(x) x <- strsplit.data.frame(brussels_reviews, term = "feedback", group = "id", split = " ", fixed = TRUE) head(x)
data(brussels_reviews, package = "udpipe") x <- strsplit.data.frame(brussels_reviews, term = "feedback", group = "id") head(x) x <- strsplit.data.frame(brussels_reviews, term = c("feedback"), group = c("listing_id", "language")) head(x) x <- strsplit.data.frame(brussels_reviews, term = "feedback", group = "id", split = " ", fixed = TRUE) head(x)
Currently undocumented
## S4 method for signature 'syntaxrelation,logical' e1 | e2 ## S4 method for signature 'logical,syntaxrelation' e1 | e2 ## S4 method for signature 'syntaxrelation,logical' e1 & e2 ## S4 method for signature 'logical,syntaxrelation' e1 & e2
## S4 method for signature 'syntaxrelation,logical' e1 | e2 ## S4 method for signature 'logical,syntaxrelation' e1 | e2 ## S4 method for signature 'syntaxrelation,logical' e1 & e2 ## S4 method for signature 'logical,syntaxrelation' e1 & e2
e1 |
Currently undocumented |
e2 |
Currently undocumented |
Collapse a character vector while removing missing data.
txt_collapse(x, collapse = " ")
txt_collapse(x, collapse = " ")
x |
a character vector or a list of character vectors |
collapse |
a character string to be used to collapse the vector. Defaults to a space: ' '. |
a character vector of length 1 with the content of x collapsed using paste
txt_collapse(c(NA, "hello", "world", NA)) x <- list(a = c("h", "i"), b = c("some", "more", "text"), c = character(), d = NA) txt_collapse(x, collapse = " ")
txt_collapse(c(NA, "hello", "world", NA)) x <- list(a = c("h", "i"), b = c("some", "more", "text"), c = character(), d = NA) txt_collapse(x, collapse = " ")
Look up text which has a certain pattern. This pattern lookup is performed by executing a regular expression using grepl
.
txt_contains(x, patterns, value = FALSE, ignore.case = TRUE, ...)
txt_contains(x, patterns, value = FALSE, ignore.case = TRUE, ...)
x |
a character vector with text |
patterns |
a regular expression which might be contained in |
value |
logical, indicating to return the elements of |
ignore.case |
logical, if set to |
... |
other parameters which can be passed on to |
a logical vector of the same length as x
indicating if one of the patterns was found in x
.
Or the vector of elements of x
where the pattern was found in case argument value
is set to TRUE
x <- c("The cats are eating catfood", "Our cat is eating the catfood", "the dog eats catfood, he likes it", NA) txt_contains(x, patterns = c("cat", "dog")) txt_contains(x, patterns = c("cat", "dog"), value = TRUE) txt_contains(x, patterns = c("eats"), value = TRUE) txt_contains(x, patterns = c("^The"), ignore.case = FALSE, value = TRUE) txt_contains(x, patterns = list(include = c("cat"), exclude = c("dog")), value = TRUE) txt_contains(x, "cat") & txt_contains(x, "dog")
x <- c("The cats are eating catfood", "Our cat is eating the catfood", "the dog eats catfood, he likes it", NA) txt_contains(x, patterns = c("cat", "dog")) txt_contains(x, patterns = c("cat", "dog"), value = TRUE) txt_contains(x, patterns = c("eats"), value = TRUE) txt_contains(x, patterns = c("^The"), ignore.case = FALSE, value = TRUE) txt_contains(x, patterns = list(include = c("cat"), exclude = c("dog")), value = TRUE) txt_contains(x, "cat") & txt_contains(x, "dog")
If you have annotated your text using udpipe_annotate
,
your text is tokenised in a sequence of words. Based on this vector of words in sequence
getting n-grams comes down to looking at the previous/next word and the subsequent previous/next word andsoforth.
These words can be pasted
together to form an n-gram.
txt_context(x, n = c(-1, 0, 1), sep = " ", na.rm = FALSE)
txt_context(x, n = c(-1, 0, 1), sep = " ", na.rm = FALSE)
x |
a character vector where each element is just 1 term or word |
n |
an integer vector indicating how many terms to look back and ahead |
sep |
a character element indicating how to |
na.rm |
logical, if set to |
a character vector of the same length of x
with the n-grams
txt_paste
, txt_next
, txt_previous
, shift
x <- c("We", "walked", "anxiously", "to", "the", "doctor", "!") ## Look 1 word before + word itself y <- txt_context(x, n = c(-1, 0), na.rm = FALSE) data.frame(x, y) ## Look 1 word before + word itself + 1 word after y <- txt_context(x, n = c(-1, 0, 1), na.rm = FALSE) data.frame(x, y) y <- txt_context(x, n = c(-1, 0, 1), na.rm = TRUE) data.frame(x, y) ## Look 2 words before + word itself + 1 word after ## even if not all words are there y <- txt_context(x, n = c(-2, -1, 0, 1), na.rm = TRUE, sep = "_") data.frame(x, y) y <- txt_context(x, n = c(-2, -1, 1, 2), na.rm = FALSE, sep = "_") data.frame(x, y) x <- c("We", NA, NA, "to", "the", "doctor", "!") y <- txt_context(x, n = c(-1, 0), na.rm = FALSE) data.frame(x, y) y <- txt_context(x, n = c(-1, 0), na.rm = TRUE) data.frame(x, y) library(data.table) data(brussels_reviews_anno, package = "udpipe") x <- as.data.table(brussels_reviews_anno) x <- subset(x, doc_id %in% txt_sample(unique(x$doc_id), n = 10)) x <- x[, context := txt_context(lemma), by = list(doc_id, sentence_id)] head(x, 20) x$term <- sprintf("%s/%s", x$lemma, x$upos) x <- x[, context := txt_context(term), by = list(doc_id, sentence_id)] head(x, 20)
x <- c("We", "walked", "anxiously", "to", "the", "doctor", "!") ## Look 1 word before + word itself y <- txt_context(x, n = c(-1, 0), na.rm = FALSE) data.frame(x, y) ## Look 1 word before + word itself + 1 word after y <- txt_context(x, n = c(-1, 0, 1), na.rm = FALSE) data.frame(x, y) y <- txt_context(x, n = c(-1, 0, 1), na.rm = TRUE) data.frame(x, y) ## Look 2 words before + word itself + 1 word after ## even if not all words are there y <- txt_context(x, n = c(-2, -1, 0, 1), na.rm = TRUE, sep = "_") data.frame(x, y) y <- txt_context(x, n = c(-2, -1, 1, 2), na.rm = FALSE, sep = "_") data.frame(x, y) x <- c("We", NA, NA, "to", "the", "doctor", "!") y <- txt_context(x, n = c(-1, 0), na.rm = FALSE) data.frame(x, y) y <- txt_context(x, n = c(-1, 0), na.rm = TRUE) data.frame(x, y) library(data.table) data(brussels_reviews_anno, package = "udpipe") x <- as.data.table(brussels_reviews_anno) x <- subset(x, doc_id %in% txt_sample(unique(x$doc_id), n = 10)) x <- x[, context := txt_context(lemma), by = list(doc_id, sentence_id)] head(x, 20) x$term <- sprintf("%s/%s", x$lemma, x$upos) x <- x[, context := txt_context(term), by = list(doc_id, sentence_id)] head(x, 20)
Count the number of times a pattern is occurring in text.
Pattern counting is performed by executing a regular expression using gregexpr
and
checking how many times the regular expression occurs.
txt_count(x, pattern, ...)
txt_count(x, pattern, ...)
x |
a character vector with text |
pattern |
a text pattern which might be contained in |
... |
other arguments, passed on to |
an integer vector of the same length as x
indicating how many times the pattern is occurring in x
x <- c("abracadabra", "ababcdab", NA) txt_count(x, pattern = "ab") txt_count(x, pattern = "AB", ignore.case = TRUE) txt_count(x, pattern = "AB", ignore.case = FALSE)
x <- c("abracadabra", "ababcdab", NA) txt_count(x, pattern = "ab") txt_count(x, pattern = "AB", ignore.case = TRUE) txt_count(x, pattern = "AB", ignore.case = FALSE)
Frequency statistics of elements in a vector
txt_freq(x, exclude = c(NA, NaN), order = TRUE)
txt_freq(x, exclude = c(NA, NaN), order = TRUE)
x |
a vector |
exclude |
logical indicating to exclude values from the table. Defaults to NA and NaN. |
order |
logical indicating to order the resulting dataset in order of frequency. Defaults to TRUE. |
a data.frame with columns key, freq and freq_pct indicating the how
many times each value in the vector x
is occurring
x <- sample(LETTERS, 1000, replace = TRUE) txt_freq(x) x <- factor(x, levels = LETTERS) txt_freq(x, order = FALSE)
x <- sample(LETTERS, 1000, replace = TRUE) txt_freq(x) x <- factor(x, levels = LETTERS) txt_freq(x, order = FALSE)
A variant of grepl
which allows to specify multiple regular expressions
and allows to combine the result of these into one logical vector.
You can specify how to combine the results of the regular expressions by specifying
an aggregate function like all
, any
, sum
.
txt_grepl( x, pattern, FUN = all, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE, ... )
txt_grepl( x, pattern, FUN = all, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE, ... )
x |
a character vector |
pattern |
a character vector containing one or several regular expressions |
FUN |
a function to apply to combine the results ot the different regular expressions for each element of |
ignore.case |
passed on to |
perl |
passed on to |
fixed |
passed on to |
useBytes |
passed on to |
... |
further arguments passed on to |
a logical vector with the same length as x
with the result of the call to FUN
applied elementwise to each result of grepl for each pattern
x <- c("--A--", "--B--", "--ABC--", "--AC--", "Z") txt_grepl(x, pattern = c("A", "C"), FUN = all) txt_grepl(x, pattern = c("A", "C"), FUN = any) txt_grepl(x, pattern = c("A", "C"), FUN = sum) data.frame(x = x, A_and_C = txt_grepl(x, pattern = c("A", "C"), FUN = all), A_or_C = txt_grepl(x, pattern = c("A", "C"), FUN = any), A_C_n = txt_grepl(x, pattern = c("A", "C"), FUN = sum)) txt_grepl(x, pattern = "A|C")
x <- c("--A--", "--B--", "--ABC--", "--AC--", "Z") txt_grepl(x, pattern = c("A", "C"), FUN = all) txt_grepl(x, pattern = c("A", "C"), FUN = any) txt_grepl(x, pattern = c("A", "C"), FUN = sum) data.frame(x = x, A_and_C = txt_grepl(x, pattern = c("A", "C"), FUN = all), A_or_C = txt_grepl(x, pattern = c("A", "C"), FUN = any), A_C_n = txt_grepl(x, pattern = c("A", "C"), FUN = sum)) txt_grepl(x, pattern = "A|C")
Highlight words in a character vector. The words provided in terms
are
highlighted in the text by wrapping it around the following charater: |.
So 'I like milk and sugar in my coffee' would give 'I like |milk| and sugar in my coffee' if you
want to highlight the word milk
txt_highlight(x, terms)
txt_highlight(x, terms)
x |
a character vector with text |
terms |
a vector of words to highlight which appear in |
A character vector with the same length of x
where the terms provided in terms
are put in between || to highlight them
x <- "I like milk and sugar in my coffee." txt_highlight(x, terms = "sugar") txt_highlight(x, terms = c("milk", "my"))
x <- "I like milk and sugar in my coffee." txt_highlight(x, terms = "sugar") txt_highlight(x, terms = c("milk", "my"))
Get the n-th next element of a vector
txt_next(x, n = 1)
txt_next(x, n = 1)
x |
a character vector where each element is just 1 term or word |
n |
an integer indicating how far to look next. Defaults to 1. |
a character vector of the same length of x
with the next element
x <- sprintf("%s%s", LETTERS, 1:26) txt_next(x, n = 1) data.frame(word = x, word_next1 = txt_next(x, n = 1), word_next2 = txt_next(x, n = 2), stringsAsFactors = FALSE)
x <- sprintf("%s%s", LETTERS, 1:26) txt_next(x, n = 1) data.frame(word = x, word_next1 = txt_next(x, n = 1), word_next2 = txt_next(x, n = 2), stringsAsFactors = FALSE)
If you have annotated your text using udpipe_annotate
,
your text is tokenised in a sequence of words. Based on this vector of words in sequence
getting n-grams comes down to looking at the next word and the subsequent word andsoforth.
These words can be pasted
together to form an n-gram containing
the current word, the next word up, the subsequent word, ...
txt_nextgram(x, n = 2, sep = " ")
txt_nextgram(x, n = 2, sep = " ")
x |
a character vector where each element is just 1 term or word |
n |
an integer indicating the ngram. Values of 1 will keep the x, a value of 2 will append the next term to the current term, a value of 3 will append the subsequent term and the term following that term to the current term |
sep |
a character element indicating how to |
a character vector of the same length of x
with the n-grams
x <- sprintf("%s%s", LETTERS, 1:26) txt_nextgram(x, n = 2) data.frame(words = x, bigram = txt_nextgram(x, n = 2), trigram = txt_nextgram(x, n = 3, sep = "-"), quatrogram = txt_nextgram(x, n = 4, sep = ""), stringsAsFactors = FALSE) x <- c("A1", "A2", "A3", NA, "A4", "A5") data.frame(x, bigram = txt_nextgram(x, n = 2, sep = "_"), stringsAsFactors = FALSE)
x <- sprintf("%s%s", LETTERS, 1:26) txt_nextgram(x, n = 2) data.frame(words = x, bigram = txt_nextgram(x, n = 2), trigram = txt_nextgram(x, n = 3, sep = "-"), quatrogram = txt_nextgram(x, n = 4, sep = ""), stringsAsFactors = FALSE) x <- c("A1", "A2", "A3", NA, "A4", "A5") data.frame(x, bigram = txt_nextgram(x, n = 2, sep = "_"), stringsAsFactors = FALSE)
Get the overlap between 2 vectors
txt_overlap(x, y)
txt_overlap(x, y)
x |
a vector |
y |
a vector |
a vector with elements of x
which are also found in y
x <- c("a", "b", "c") y <- c("b", "c", "e", "z") txt_overlap(x, y) txt_overlap(y, x)
x <- c("a", "b", "c") y <- c("b", "c", "e", "z") txt_overlap(x, y) txt_overlap(y, x)
NA friendly version for concatenating string
txt_paste(..., collapse = " ", na.rm = FALSE)
txt_paste(..., collapse = " ", na.rm = FALSE)
... |
character vectors |
collapse |
a character string to be used to paste the vectors together. Defaults to a space: ' '. |
na.rm |
logical, if set to |
a character vector
x <- c(1, 2, 3, NA, NA) y <- c("a", "b", "c", NA, "OK") paste(x, y, sep = "-") txt_paste(x, y, collapse = "-", na.rm = TRUE) txt_paste(x, y, collapse = "-", na.rm = FALSE) x <- c(NA, "a", "b") y <- c("1", "2", NA) z <- c("-", "*", NA) txt_paste(x, y, z, collapse = "", na.rm = TRUE) txt_paste(x, y, z, "_____", collapse = "", na.rm = TRUE) txt_paste(x, y, z, "_____", collapse = "", na.rm = FALSE)
x <- c(1, 2, 3, NA, NA) y <- c("a", "b", "c", NA, "OK") paste(x, y, sep = "-") txt_paste(x, y, collapse = "-", na.rm = TRUE) txt_paste(x, y, collapse = "-", na.rm = FALSE) x <- c(NA, "a", "b") y <- c("1", "2", NA) z <- c("-", "*", NA) txt_paste(x, y, z, collapse = "", na.rm = TRUE) txt_paste(x, y, z, "_____", collapse = "", na.rm = TRUE) txt_paste(x, y, z, "_____", collapse = "", na.rm = FALSE)
Get the n-th previous element of a vector
txt_previous(x, n = 1)
txt_previous(x, n = 1)
x |
a character vector where each element is just 1 term or word |
n |
an integer indicating how far to look back. Defaults to 1. |
a character vector of the same length of x
with the previous element
x <- sprintf("%s%s", LETTERS, 1:26) txt_previous(x, n = 1) data.frame(word = x, word_previous1 = txt_previous(x, n = 1), word_previous2 = txt_previous(x, n = 2), stringsAsFactors = FALSE)
x <- sprintf("%s%s", LETTERS, 1:26) txt_previous(x, n = 1) data.frame(word = x, word_previous1 = txt_previous(x, n = 1), word_previous2 = txt_previous(x, n = 2), stringsAsFactors = FALSE)
If you have annotated your text using udpipe_annotate
,
your text is tokenised in a sequence of words. Based on this vector of words in sequence
getting n-grams comes down to looking at the previous word and the subsequent previous word andsoforth.
These words can be pasted
together to form an n-gram containing
the second previous word, the previous word, the current word ...
txt_previousgram(x, n = 2, sep = " ")
txt_previousgram(x, n = 2, sep = " ")
x |
a character vector where each element is just 1 term or word |
n |
an integer indicating the ngram. Values of 1 will keep the x, a value of 2 will append the previous term to the current term, a value of 3 will append the second previous term term and the previous term preceding the current term to the current term |
sep |
a character element indicating how to |
a character vector of the same length of x
with the n-grams
x <- sprintf("%s%s", LETTERS, 1:26) txt_previousgram(x, n = 2) data.frame(words = x, bigram = txt_previousgram(x, n = 2), trigram = txt_previousgram(x, n = 3, sep = "-"), quatrogram = txt_previousgram(x, n = 4, sep = ""), stringsAsFactors = FALSE) x <- c("A1", "A2", "A3", NA, "A4", "A5") data.frame(x, bigram = txt_previousgram(x, n = 2, sep = "_"), stringsAsFactors = FALSE)
x <- sprintf("%s%s", LETTERS, 1:26) txt_previousgram(x, n = 2) data.frame(words = x, bigram = txt_previousgram(x, n = 2), trigram = txt_previousgram(x, n = 3, sep = "-"), quatrogram = txt_previousgram(x, n = 4, sep = ""), stringsAsFactors = FALSE) x <- c("A1", "A2", "A3", NA, "A4", "A5") data.frame(x, bigram = txt_previousgram(x, n = 2, sep = "_"), stringsAsFactors = FALSE)
Recode text to other categories.
Values of x
which correspond to from[i]
will be recoded to to[i]
txt_recode(x, from = c(), to = c(), na.rm = FALSE)
txt_recode(x, from = c(), to = c(), na.rm = FALSE)
x |
a character vector |
from |
a character vector with values of |
to |
a character vector with values of you want to use to recode to where you
want to replace values of |
na.rm |
logical, if set to TRUE, will put all values of |
a character vector of the same length of x
where values of x
which are given in from
will be replaced by the corresponding element in to
x <- c("NOUN", "VERB", "NOUN", "ADV") txt_recode(x = x, from = c("VERB", "ADV"), to = c("conjugated verb", "adverb")) txt_recode(x = x, from = c("VERB", "ADV"), to = c("conjugated verb", "adverb"), na.rm = TRUE) txt_recode(x = x, from = c("VERB", "ADV", "NOUN"), to = c("conjugated verb", "adverb", "noun"), na.rm = TRUE)
x <- c("NOUN", "VERB", "NOUN", "ADV") txt_recode(x = x, from = c("VERB", "ADV"), to = c("conjugated verb", "adverb")) txt_recode(x = x, from = c("VERB", "ADV"), to = c("conjugated verb", "adverb"), na.rm = TRUE) txt_recode(x = x, from = c("VERB", "ADV", "NOUN"), to = c("conjugated verb", "adverb", "noun"), na.rm = TRUE)
Replace in a character vector of tokens, tokens with compound multi-word expressions.
So that c("New", "York")
will be c("New York", NA)
.
txt_recode_ngram(x, compound, ngram, sep = " ")
txt_recode_ngram(x, compound, ngram, sep = " ")
x |
a character vector of words where you want to replace tokens with compound multi-word expressions.
This is generally a character vector as returned by the token column of |
compound |
a character vector of compound words multi-word expressions indicating terms which can be considered as one word.
For example |
ngram |
a integer vector of the same length as |
sep |
separator used when the compounds were constructed by combining the words together into a compound multi-word expression. Defaults to a space: ' '. |
the same character vector x
where elements in x
will be replaced by compound multi-word expression.
If will give preference to replacing with compounds with higher ngrams if these occur. See the examples.
x <- c("I", "went", "to", "New", "York", "City", "on", "holiday", ".") y <- txt_recode_ngram(x, compound = "New York", ngram = 2, sep = " ") data.frame(x, y) keyw <- data.frame(keyword = c("New-York", "New-York-City"), ngram = c(2, 3)) y <- txt_recode_ngram(x, compound = keyw$keyword, ngram = keyw$ngram, sep = "-") data.frame(x, y) ## Example replacing adjectives followed by a noun with the full compound word data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, language == "nl") keyw <- keywords_phrases(x$xpos, term = x$token, pattern = "JJNN", is_regex = TRUE, detailed = FALSE) head(keyw) x$term <- txt_recode_ngram(x$token, compound = keyw$keyword, ngram = keyw$ngram) head(x[, c("token", "term", "xpos")], 12)
x <- c("I", "went", "to", "New", "York", "City", "on", "holiday", ".") y <- txt_recode_ngram(x, compound = "New York", ngram = 2, sep = " ") data.frame(x, y) keyw <- data.frame(keyword = c("New-York", "New-York-City"), ngram = c(2, 3)) y <- txt_recode_ngram(x, compound = keyw$keyword, ngram = keyw$ngram, sep = "-") data.frame(x, y) ## Example replacing adjectives followed by a noun with the full compound word data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, language == "nl") keyw <- keywords_phrases(x$xpos, term = x$token, pattern = "JJNN", is_regex = TRUE, detailed = FALSE) head(keyw) x$term <- txt_recode_ngram(x$token, compound = keyw$keyword, ngram = keyw$ngram) head(x[, c("token", "term", "xpos")], 12)
Boilerplate function to sample one element from a vector.
txt_sample(x, na.exclude = TRUE, n = 1)
txt_sample(x, na.exclude = TRUE, n = 1)
x |
a vector |
na.exclude |
logical indicating to remove NA values before taking a sample |
n |
integer indicating the number of items to sample from |
one element sampled from the vector x
txt_sample(c(NA, "hello", "world", NA))
txt_sample(c(NA, "hello", "world", NA))
This function identifies words which have a positive/negative meaning, with the addition of some basic logic regarding occurrences of amplifiers/deamplifiers and negators in the neighbourhood of the word which has a positive/negative meaning.
If a negator is occurring in the neigbourhood, positive becomes negative or vice versa.
If amplifiers/deamplifiers occur in the neigbourhood, these amplifier weight is added to the sentiment polarity score.
This function took inspiration from qdap::polarity but was completely re-engineered to allow to calculate similar things on
a udpipe-tokenised dataset. It works on a sentence level and the negator/amplification logic can not surpass a boundary defined
by the PUNCT upos parts of speech tag.
Note that if you prefer to build a supervised model to perform sentiment scoring you might be interested in looking at the ruimtehol R package https://github.com/bnosac/ruimtehol instead.
txt_sentiment( x, term = "lemma", polarity_terms, polarity_negators = character(), polarity_amplifiers = character(), polarity_deamplifiers = character(), amplifier_weight = 0.8, n_before = 4, n_after = 2, constrain = FALSE )
txt_sentiment( x, term = "lemma", polarity_terms, polarity_negators = character(), polarity_amplifiers = character(), polarity_deamplifiers = character(), amplifier_weight = 0.8, n_before = 4, n_after = 2, constrain = FALSE )
x |
a data.frame with the columns doc_id, paragraph_id, sentence_id, upos and the column as indicated in |
term |
a character string with the name of a column of |
polarity_terms |
data.frame containing terms which have positive or negative meaning. This data frame should contain the columns term and polarity where term is of type character and polarity can either be 1 or -1. |
polarity_negators |
a character vector of words which will invert the meaning of the |
polarity_amplifiers |
a character vector of words which amplify the |
polarity_deamplifiers |
a character vector of words which deamplify the |
amplifier_weight |
weight which is added to the polarity score if an amplifier occurs in the neighbourhood |
n_before |
integer indicating how many words before the |
n_after |
integer indicating how many words after the |
constrain |
logical indicating to make sure the aggregated sentiment scores is between -1 and 1 |
a list containing
data: the x
data.frame with 2 columns added: polarity and sentiment_polarity.
The column polarity being just the polarity column of the polarity_terms
dataset corresponding to the polarity of the term
you apply the sentiment scoring
The colummn sentiment_polarity is the value where the amplifier/de-amplifier/negator logic is applied on.
overall: a data.frame with one row per doc_id containing the columns doc_id, sentences,
terms, sentiment_polarity, terms_positive, terms_negative, terms_negation and terms_amplification
providing the aggregate sentiment_polarity score of the dataset x
by doc_id as well as
the terminology causing the sentiment, the number of sentences and the number of non punctuation terms in the document.
x <- c("I do not like whatsoever when an R package has soo many dependencies.", "Making other people install java is annoying, as it is a really painful experience in classrooms.") ## Not run: ## Do the annotation to get the data.frame needed as input to txt_sentiment anno <- udpipe(x, "english-gum") ## End(Not run) anno <- data.frame(doc_id = c(rep("doc1", 14), rep("doc2", 18)), paragraph_id = 1, sentence_id = 1, lemma = c("I", "do", "not", "like", "whatsoever", "when", "an", "R", "package", "has", "soo", "many", "dependencies", ".", "Making", "other", "people", "install", "java", "is", "annoying", ",", "as", "it", "is", "a", "really", "painful", "experience", "in", "classrooms", "."), upos = c("PRON", "AUX", "PART", "VERB", "PRON", "SCONJ", "DET", "PROPN", "NOUN", "VERB", "ADV", "ADJ", "NOUN", "PUNCT", "VERB", "ADJ", "NOUN", "ADJ", "NOUN", "AUX", "VERB", "PUNCT", "SCONJ", "PRON", "AUX", "DET", "ADV", "ADJ", "NOUN", "ADP", "NOUN", "PUNCT"), stringsasFactors = FALSE) scores <- txt_sentiment(x = anno, term = "lemma", polarity_terms = data.frame(term = c("annoy", "like", "painful"), polarity = c(-1, 1, -1)), polarity_negators = c("not", "neither"), polarity_amplifiers = c("pretty", "many", "really", "whatsoever"), polarity_deamplifiers = c("slightly", "somewhat")) scores$overall scores$data scores <- txt_sentiment(x = anno, term = "lemma", polarity_terms = data.frame(term = c("annoy", "like", "painful"), polarity = c(-1, 1, -1)), polarity_negators = c("not", "neither"), polarity_amplifiers = c("pretty", "many", "really", "whatsoever"), polarity_deamplifiers = c("slightly", "somewhat"), constrain = TRUE, n_before = 4, n_after = 2, amplifier_weight = .8) scores$overall scores$data
x <- c("I do not like whatsoever when an R package has soo many dependencies.", "Making other people install java is annoying, as it is a really painful experience in classrooms.") ## Not run: ## Do the annotation to get the data.frame needed as input to txt_sentiment anno <- udpipe(x, "english-gum") ## End(Not run) anno <- data.frame(doc_id = c(rep("doc1", 14), rep("doc2", 18)), paragraph_id = 1, sentence_id = 1, lemma = c("I", "do", "not", "like", "whatsoever", "when", "an", "R", "package", "has", "soo", "many", "dependencies", ".", "Making", "other", "people", "install", "java", "is", "annoying", ",", "as", "it", "is", "a", "really", "painful", "experience", "in", "classrooms", "."), upos = c("PRON", "AUX", "PART", "VERB", "PRON", "SCONJ", "DET", "PROPN", "NOUN", "VERB", "ADV", "ADJ", "NOUN", "PUNCT", "VERB", "ADJ", "NOUN", "ADJ", "NOUN", "AUX", "VERB", "PUNCT", "SCONJ", "PRON", "AUX", "DET", "ADV", "ADJ", "NOUN", "ADP", "NOUN", "PUNCT"), stringsasFactors = FALSE) scores <- txt_sentiment(x = anno, term = "lemma", polarity_terms = data.frame(term = c("annoy", "like", "painful"), polarity = c(-1, 1, -1)), polarity_negators = c("not", "neither"), polarity_amplifiers = c("pretty", "many", "really", "whatsoever"), polarity_deamplifiers = c("slightly", "somewhat")) scores$overall scores$data scores <- txt_sentiment(x = anno, term = "lemma", polarity_terms = data.frame(term = c("annoy", "like", "painful"), polarity = c(-1, 1, -1)), polarity_negators = c("not", "neither"), polarity_amplifiers = c("pretty", "many", "really", "whatsoever"), polarity_deamplifiers = c("slightly", "somewhat"), constrain = TRUE, n_before = 4, n_after = 2, amplifier_weight = .8) scores$overall scores$data
Boilerplate function to cat only 1 element of a character vector.
txt_show(x)
txt_show(x)
x |
a character vector |
invisible
txt_show(c("hello \n\n\n world", "world \n\n\n hello"))
txt_show(c("hello \n\n\n world", "world \n\n\n hello"))
This function allows to identify contiguous sequences of text which have the same label or
which follow the IOB scheme.
Named Entity Recognition or Chunking frequently follows the IOB tagging scheme
where "B" means the token begins an entity, "I" means it is inside an entity,
"E" means it is the end of an entity and "O" means it is not part of an entity.
An example of such an annotation would be 'New', 'York', 'City', 'District' which can be tagged as
'B-LOC', 'I-LOC', 'I-LOC', 'E-LOC'.
The function looks for such sequences which start with 'B-LOC' and combines all subsequent
labels of the same tagging group into 1 category. This sequence of words also gets a unique identifier such
that the terms 'New', 'York', 'City', 'District' would get the same sequence identifier.
txt_tagsequence(x, entities)
txt_tagsequence(x, entities)
x |
a character vector of categories in the sequence of occurring (e.g. B-LOC, I-LOC, I-PER, B-PER, O, O, B-PER) |
entities |
a list of groups, where each list element contains
The list name of the group defines the label that will be assigned to the entity. If |
a list with elements entity_id
and entity
where
entity is a character vector of the same length as x
containing entities ,
constructed by recoding x
to the names of names(entities
)
entity_id is an integer vector of the same length as x
containing unique identifiers identfying the compound label sequence such that
e.g. the sequence 'B-LOC', 'I-LOC', 'I-LOC', 'E-LOC' (New York City District) would get the same entity_id
identifier.
See the examples.
x <- data.frame( token = c("The", "chairman", "of", "the", "Nakitoma", "Corporation", "Donald", "Duck", "went", "skiing", "in", "the", "Niagara", "Falls"), upos = c("DET", "NOUN", "ADP", "DET", "PROPN", "PROPN", "PROPN", "PROPN", "VERB", "VERB", "ADP", "DET", "PROPN", "PROPN"), label = c("O", "O", "O", "O", "B-ORG", "I-ORG", "B-PERSON", "I-PERSON", "O", "O", "O", "O", "B-LOCATION", "I-LOCATION"), stringsAsFactors = FALSE) x[, c("sequence_id", "group")] <- txt_tagsequence(x$upos) x ## ## Define entity groups following the IOB scheme ## and combine B-LOC I-LOC I-LOC sequences as 1 group (e.g. New York City) groups <- list( Location = list(start = "B-LOC", labels = c("B-LOC", "I-LOC", "E-LOC")), Organisation = list(start = "B-ORG", labels = c("B-ORG", "I-ORG", "E-ORG")), Person = list(start = "B-PER", labels = c("B-PER", "I-PER", "E-PER")), Misc = list(start = "B-MISC", labels = c("B-MISC", "I-MISC", "E-MISC"))) x[, c("entity_id", "entity")] <- txt_tagsequence(x$label, groups) x
x <- data.frame( token = c("The", "chairman", "of", "the", "Nakitoma", "Corporation", "Donald", "Duck", "went", "skiing", "in", "the", "Niagara", "Falls"), upos = c("DET", "NOUN", "ADP", "DET", "PROPN", "PROPN", "PROPN", "PROPN", "VERB", "VERB", "ADP", "DET", "PROPN", "PROPN"), label = c("O", "O", "O", "O", "B-ORG", "I-ORG", "B-PERSON", "I-PERSON", "O", "O", "O", "O", "B-LOCATION", "I-LOCATION"), stringsAsFactors = FALSE) x[, c("sequence_id", "group")] <- txt_tagsequence(x$upos) x ## ## Define entity groups following the IOB scheme ## and combine B-LOC I-LOC I-LOC sequences as 1 group (e.g. New York City) groups <- list( Location = list(start = "B-LOC", labels = c("B-LOC", "I-LOC", "E-LOC")), Organisation = list(start = "B-ORG", labels = c("B-ORG", "I-ORG", "E-ORG")), Person = list(start = "B-PER", labels = c("B-PER", "I-PER", "E-PER")), Misc = list(start = "B-MISC", labels = c("B-MISC", "I-MISC", "E-MISC"))) x[, c("entity_id", "entity")] <- txt_tagsequence(x$label, groups) x
Tokenising, Lemmatising, Tagging and Dependency Parsing of raw text in TIF format
udpipe(x, object, parallel.cores = 1L, parallel.chunksize, ...)
udpipe(x, object, parallel.cores = 1L, parallel.chunksize, ...)
x |
either
All text data should be in UTF-8 encoding |
object |
either an object of class |
parallel.cores |
integer indicating the number of parallel cores to use to speed up the annotation. Defaults to 1 (use only 1 single thread). |
parallel.chunksize |
integer with the size of the chunks of text to be annotated in parallel. If not provided, defaults to the size of |
... |
other elements to pass on to |
a data.frame with one row per doc_id and term_id containing all the tokens in the data, the lemma, the part of speech tags, the morphological features and the dependency relationship along the tokens. The data.frame has the following fields:
doc_id: The document identifier.
paragraph_id: The paragraph identifier which is unique within each document.
sentence_id: The sentence identifier which is unique within each document.
sentence: The text of the sentence of the sentence_id.
start: Integer index indicating in the original text where the token starts. Missing in case of tokens part of multi-word tokens which are not in the text.
end: Integer index indicating in the original text where the token ends. Missing in case of tokens part of multi-word tokens which are not in the text.
term_id: A row identifier which is unique within the doc_id identifier.
token_id: Token index, integer starting at 1 for each new sentence. May be a range for multiword tokens or a decimal number for empty nodes.
token: The token.
lemma: The lemma of the token.
upos: The universal parts of speech tag of the token. See https://universaldependencies.org/format.html
xpos: The treebank-specific parts of speech tag of the token. See https://universaldependencies.org/format.html
feats: The morphological features of the token, separated by |. See https://universaldependencies.org/format.html
head_token_id: Indicating what is the token_id of the head of the token, indicating to which other token in the sentence it is related. See https://universaldependencies.org/format.html
dep_rel: The type of relation the token has with the head_token_id. See https://universaldependencies.org/format.html
deps: Enhanced dependency graph in the form of a list of head-deprel pairs. See https://universaldependencies.org/format.html
misc: SpacesBefore/SpacesAfter/SpacesInToken spaces before/after/inside the token. Used to reconstruct the original text. See https://ufal.mff.cuni.cz/udpipe/1/users-manual
The columns paragraph_id, sentence_id, term_id, start, end are integers, the other fields
are character data in UTF-8 encoding.
https://ufal.mff.cuni.cz/udpipe, https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2364, https://universaldependencies.org/format.html
udpipe_load_model
, as.data.frame.udpipe_connlu
, udpipe_download_model
, udpipe_annotate
model <- udpipe_download_model(language = "dutch-lassysmall") if(!model$download_failed){ ud_dutch <- udpipe_load_model(model) ## Tokenise, Tag and Dependency Parsing Annotation. Output is in CONLL-U format. txt <- c("Dus. Godvermehoeren met pus in alle puisten, zei die schele van Van Bukburg en hij had nog gelijk ook. Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont, maar dat hoefden we geenseens niet te zingen. Je kunt zeggen wat je wil van al die gesluierde poezenpas maar d'r kwam wel een vleeswarenwinkel onder te voorschijn van heb je me daar nou. En zo gaat het maar door.", "Wat die ransaap van een academici nou weer in z'n botte pan heb gehaald mag Joost in m'n schoen gooien, maar feit staat boven water dat het een gore vieze vuile ransaap is.") names(txt) <- c("document_identifier_1", "we-like-ilya-leonard-pfeiffer") ## ## TIF tagging: tag if x is a character vector, a data frame or a token sequence ## x <- udpipe(txt, object = ud_dutch) x <- udpipe(data.frame(doc_id = names(txt), text = txt, stringsAsFactors = FALSE), object = ud_dutch) x <- udpipe(strsplit(txt, "[[:space:][:punct:][:digit:]]+"), object = ud_dutch) ## You can also directly pass on the language in the call to udpipe x <- udpipe("Dit werkt ook.", object = "dutch-lassysmall") x <- udpipe(txt, object = "dutch-lassysmall") x <- udpipe(data.frame(doc_id = names(txt), text = txt, stringsAsFactors = FALSE), object = "dutch-lassysmall") x <- udpipe(strsplit(txt, "[[:space:][:punct:][:digit:]]+"), object = "dutch-lassysmall") } ## cleanup for CRAN only - you probably want to keep your model if you have downloaded it if(file.exists(model$file_model)) file.remove(model$file_model)
model <- udpipe_download_model(language = "dutch-lassysmall") if(!model$download_failed){ ud_dutch <- udpipe_load_model(model) ## Tokenise, Tag and Dependency Parsing Annotation. Output is in CONLL-U format. txt <- c("Dus. Godvermehoeren met pus in alle puisten, zei die schele van Van Bukburg en hij had nog gelijk ook. Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont, maar dat hoefden we geenseens niet te zingen. Je kunt zeggen wat je wil van al die gesluierde poezenpas maar d'r kwam wel een vleeswarenwinkel onder te voorschijn van heb je me daar nou. En zo gaat het maar door.", "Wat die ransaap van een academici nou weer in z'n botte pan heb gehaald mag Joost in m'n schoen gooien, maar feit staat boven water dat het een gore vieze vuile ransaap is.") names(txt) <- c("document_identifier_1", "we-like-ilya-leonard-pfeiffer") ## ## TIF tagging: tag if x is a character vector, a data frame or a token sequence ## x <- udpipe(txt, object = ud_dutch) x <- udpipe(data.frame(doc_id = names(txt), text = txt, stringsAsFactors = FALSE), object = ud_dutch) x <- udpipe(strsplit(txt, "[[:space:][:punct:][:digit:]]+"), object = ud_dutch) ## You can also directly pass on the language in the call to udpipe x <- udpipe("Dit werkt ook.", object = "dutch-lassysmall") x <- udpipe(txt, object = "dutch-lassysmall") x <- udpipe(data.frame(doc_id = names(txt), text = txt, stringsAsFactors = FALSE), object = "dutch-lassysmall") x <- udpipe(strsplit(txt, "[[:space:][:punct:][:digit:]]+"), object = "dutch-lassysmall") } ## cleanup for CRAN only - you probably want to keep your model if you have downloaded it if(file.exists(model$file_model)) file.remove(model$file_model)
Get precision, recall and F1 measures on finding words / sentences / upos / xpos / features annotation as well as UAS and LAS dependency scores on holdout data in conllu format.
udpipe_accuracy( object, file_conllu, tokenizer = c("default", "none"), tagger = c("default", "none"), parser = c("default", "none") )
udpipe_accuracy( object, file_conllu, tokenizer = c("default", "none"), tagger = c("default", "none"), parser = c("default", "none") )
object |
an object of class |
file_conllu |
the full path to a file on disk containing holdout data in conllu format |
tokenizer |
a character string of length 1, which is either 'default' or 'none' |
tagger |
a character string of length 1, which is either 'default' or 'none' |
parser |
a character string of length 1, which is either 'default' or 'none' |
a list with 3 elements
accuracy: A character vector with accuracy metrics.
error: A character string with possible errors when calculating the accuracy metrics
https://ufal.mff.cuni.cz/udpipe, https://universaldependencies.org/format.html
model <- udpipe_download_model(language = "dutch-lassysmall") if(!model$download_failed){ ud_dutch <- udpipe_load_model(model$file_model) file_conllu <- system.file(package = "udpipe", "dummydata", "traindata.conllu") metrics <- udpipe_accuracy(ud_dutch, file_conllu) metrics$accuracy metrics <- udpipe_accuracy(ud_dutch, file_conllu, tokenizer = "none", tagger = "default", parser = "default") metrics$accuracy metrics <- udpipe_accuracy(ud_dutch, file_conllu, tokenizer = "none", tagger = "none", parser = "default") metrics$accuracy metrics <- udpipe_accuracy(ud_dutch, file_conllu, tokenizer = "default", tagger = "none", parser = "none") metrics$accuracy } ## cleanup for CRAN only - you probably want to keep your model if you have downloaded it if(file.exists(model$file_model)) file.remove(model$file_model)
model <- udpipe_download_model(language = "dutch-lassysmall") if(!model$download_failed){ ud_dutch <- udpipe_load_model(model$file_model) file_conllu <- system.file(package = "udpipe", "dummydata", "traindata.conllu") metrics <- udpipe_accuracy(ud_dutch, file_conllu) metrics$accuracy metrics <- udpipe_accuracy(ud_dutch, file_conllu, tokenizer = "none", tagger = "default", parser = "default") metrics$accuracy metrics <- udpipe_accuracy(ud_dutch, file_conllu, tokenizer = "none", tagger = "none", parser = "default") metrics$accuracy metrics <- udpipe_accuracy(ud_dutch, file_conllu, tokenizer = "default", tagger = "none", parser = "none") metrics$accuracy } ## cleanup for CRAN only - you probably want to keep your model if you have downloaded it if(file.exists(model$file_model)) file.remove(model$file_model)
Tokenising, Lemmatising, Tagging and Dependency Parsing Annotation of raw text
udpipe_annotate( object, x, doc_id = paste("doc", seq_along(x), sep = ""), tokenizer = "tokenizer", tagger = c("default", "none"), parser = c("default", "none"), trace = FALSE, ... )
udpipe_annotate( object, x, doc_id = paste("doc", seq_along(x), sep = ""), tokenizer = "tokenizer", tagger = c("default", "none"), parser = c("default", "none"), trace = FALSE, ... )
object |
an object of class |
x |
a character vector in UTF-8 encoding where each element of the character vector contains text which you like to tokenize, tag and perform dependency parsing. |
doc_id |
an identifier of a document with the same length as |
tokenizer |
a character string of length 1, which is either 'tokenizer' (default udpipe tokenisation)
or a character string with more complex tokenisation options
as specified in https://ufal.mff.cuni.cz/udpipe/1/users-manual in which case |
tagger |
a character string of length 1, which is either 'default' (default udpipe POS tagging and lemmatisation)
or 'none' (no POS tagging and lemmatisation needed) or a character string with more complex tagging options
as specified in https://ufal.mff.cuni.cz/udpipe/1/users-manual in which case |
parser |
a character string of length 1, which is either 'default' (default udpipe dependency parsing) or
'none' (no dependency parsing needed) or a character string with more complex parsing options
as specified in https://ufal.mff.cuni.cz/udpipe/1/users-manual in which case |
trace |
A non-negative integer indicating to show progress on the annotation.
If positive it prints out a message before each |
... |
currently not used |
a list with 3 elements
x: The x
character vector with text.
conllu: A character vector of length 1 containing the annotated result of the annotation flow in CONLL-U format. This format is explained at https://universaldependencies.org/format.html
error: A vector with the same length of x
containing possible errors when annotating x
https://ufal.mff.cuni.cz/udpipe, https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2364, https://universaldependencies.org/format.html
udpipe_load_model
, as.data.frame.udpipe_connlu
model <- udpipe_download_model(language = "dutch-lassysmall") if(!model$download_failed){ ud_dutch <- udpipe_load_model(model$file_model) ## Tokenise, Tag and Dependency Parsing Annotation. Output is in CONLL-U format. txt <- c("Dus. Godvermehoeren met pus in alle puisten, zei die schele van Van Bukburg en hij had nog gelijk ook. Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont, maar dat hoefden we geenseens niet te zingen. Je kunt zeggen wat je wil van al die gesluierde poezenpas maar d'r kwam wel een vleeswarenwinkel onder te voorschijn van heb je me daar nou. En zo gaat het maar door.", "Wat die ransaap van een academici nou weer in z'n botte pan heb gehaald mag Joost in m'n schoen gooien, maar feit staat boven water dat het een gore vieze vuile ransaap is.") x <- udpipe_annotate(ud_dutch, x = txt) cat(x$conllu) as.data.frame(x) ## Only tokenisation x <- udpipe_annotate(ud_dutch, x = txt, tagger = "none", parser = "none") as.data.frame(x) ## Only tokenisation and POS tagging + lemmatisation, no dependency parsing x <- udpipe_annotate(ud_dutch, x = txt, tagger = "default", parser = "none") as.data.frame(x) ## Only tokenisation and dependency parsing, no POS tagging nor lemmatisation x <- udpipe_annotate(ud_dutch, x = txt, tagger = "none", parser = "default") as.data.frame(x) ## Provide doc_id for joining and identification purpose x <- udpipe_annotate(ud_dutch, x = txt, doc_id = c("id1", "feedbackabc"), tagger = "none", parser = "none", trace = TRUE) as.data.frame(x) ## Mark on encodings: if your data is not in UTF-8 encoding, make sure you convert it to UTF-8 ## This can be done using iconv as follows for example udpipe_annotate(ud_dutch, x = iconv('Ik drink melk bij mijn koffie.', to = "UTF-8")) } ## cleanup for CRAN only - you probably want to keep your model if you have downloaded it if(file.exists(model$file_model)) file.remove(model$file_model)
model <- udpipe_download_model(language = "dutch-lassysmall") if(!model$download_failed){ ud_dutch <- udpipe_load_model(model$file_model) ## Tokenise, Tag and Dependency Parsing Annotation. Output is in CONLL-U format. txt <- c("Dus. Godvermehoeren met pus in alle puisten, zei die schele van Van Bukburg en hij had nog gelijk ook. Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont, maar dat hoefden we geenseens niet te zingen. Je kunt zeggen wat je wil van al die gesluierde poezenpas maar d'r kwam wel een vleeswarenwinkel onder te voorschijn van heb je me daar nou. En zo gaat het maar door.", "Wat die ransaap van een academici nou weer in z'n botte pan heb gehaald mag Joost in m'n schoen gooien, maar feit staat boven water dat het een gore vieze vuile ransaap is.") x <- udpipe_annotate(ud_dutch, x = txt) cat(x$conllu) as.data.frame(x) ## Only tokenisation x <- udpipe_annotate(ud_dutch, x = txt, tagger = "none", parser = "none") as.data.frame(x) ## Only tokenisation and POS tagging + lemmatisation, no dependency parsing x <- udpipe_annotate(ud_dutch, x = txt, tagger = "default", parser = "none") as.data.frame(x) ## Only tokenisation and dependency parsing, no POS tagging nor lemmatisation x <- udpipe_annotate(ud_dutch, x = txt, tagger = "none", parser = "default") as.data.frame(x) ## Provide doc_id for joining and identification purpose x <- udpipe_annotate(ud_dutch, x = txt, doc_id = c("id1", "feedbackabc"), tagger = "none", parser = "none", trace = TRUE) as.data.frame(x) ## Mark on encodings: if your data is not in UTF-8 encoding, make sure you convert it to UTF-8 ## This can be done using iconv as follows for example udpipe_annotate(ud_dutch, x = iconv('Ik drink melk bij mijn koffie.', to = "UTF-8")) } ## cleanup for CRAN only - you probably want to keep your model if you have downloaded it if(file.exists(model$file_model)) file.remove(model$file_model)
In order to show the settings which were used by the UDPipe community when building
the models made available when using udpipe_download_model
,
the tokenizer settings used for the different treebanks are shown below,
so that you can easily use this to retrain your model directly on the corresponding
UD treebank which you can download at http://universaldependencies.org/#ud-treebanks
.
More information on how the models provided by the UDPipe community have been built are available at https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2364
https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2364
data(udpipe_annotation_params) str(udpipe_annotation_params) ## settings of the tokenizer head(udpipe_annotation_params$tokenizer) ## settings of the tagger subset(udpipe_annotation_params$tagger, language_treebank == "nl") ## settings of the parser udpipe_annotation_params$parser
data(udpipe_annotation_params) str(udpipe_annotation_params) ## settings of the tokenizer head(udpipe_annotation_params$tokenizer) ## settings of the tagger subset(udpipe_annotation_params$tagger, language_treebank == "nl") ## settings of the parser udpipe_annotation_params$parser
Ready-made models for 65 languages trained on 101 treebanks from https://universaldependencies.org/ are provided to you.
Some of these models were provided by the UDPipe community. Other models were build using this R package.
You can either download these models manually in order to use it for annotation purposes
or use udpipe_download_model
to download these models for a specific language of choice. You have the following options:
udpipe_download_model( language = c("afrikaans-afribooms", "ancient_greek-perseus", "ancient_greek-proiel", "arabic-padt", "armenian-armtdp", "basque-bdt", "belarusian-hse", "bulgarian-btb", "buryat-bdt", "catalan-ancora", "chinese-gsd", "chinese-gsdsimp", "classical_chinese-kyoto", "coptic-scriptorium", "croatian-set", "czech-cac", "czech-cltt", "czech-fictree", "czech-pdt", "danish-ddt", "dutch-alpino", "dutch-lassysmall", "english-ewt", "english-gum", "english-lines", "english-partut", "estonian-edt", "estonian-ewt", "finnish-ftb", "finnish-tdt", "french-gsd", "french-partut", "french-sequoia", "french-spoken", "galician-ctg", "galician-treegal", "german-gsd", "german-hdt", "gothic-proiel", "greek-gdt", "hebrew-htb", "hindi-hdtb", "hungarian-szeged", "indonesian-gsd", "irish-idt", "italian-isdt", "italian-partut", "italian-postwita", "italian-twittiro", "italian-vit", "japanese-gsd", "kazakh-ktb", "korean-gsd", "korean-kaist", "kurmanji-mg", "latin-ittb", "latin-perseus", "latin-proiel", "latvian-lvtb", "lithuanian-alksnis", "lithuanian-hse", "maltese-mudt", "marathi-ufal", "north_sami-giella", "norwegian-bokmaal", "norwegian-nynorsk", "norwegian-nynorsklia", "old_church_slavonic-proiel", "old_french-srcmf", "old_russian-torot", "persian-seraji", "polish-lfg", "polish-pdb", "polish-sz", "portuguese-bosque", "portuguese-br", "portuguese-gsd", "romanian-nonstandard", "romanian-rrt", "russian-gsd", "russian-syntagrus", "russian-taiga", "sanskrit-ufal", "scottish_gaelic-arcosg", "serbian-set", "slovak-snk", "slovenian-ssj", "slovenian-sst", "spanish-ancora", "spanish-gsd", "swedish-lines", "swedish-talbanken", "tamil-ttb", "telugu-mtg", "turkish-imst", "ukrainian-iu", "upper_sorbian-ufal", "urdu-udtb", "uyghur-udt", "vietnamese-vtb", "wolof-wtb"), model_dir = getwd(), udpipe_model_repo = c("jwijffels/udpipe.models.ud.2.5", "jwijffels/udpipe.models.ud.2.4", "jwijffels/udpipe.models.ud.2.3", "jwijffels/udpipe.models.ud.2.0", "jwijffels/udpipe.models.conll18.baseline", "bnosac/udpipe.models.ud"), overwrite = TRUE, ... )
udpipe_download_model( language = c("afrikaans-afribooms", "ancient_greek-perseus", "ancient_greek-proiel", "arabic-padt", "armenian-armtdp", "basque-bdt", "belarusian-hse", "bulgarian-btb", "buryat-bdt", "catalan-ancora", "chinese-gsd", "chinese-gsdsimp", "classical_chinese-kyoto", "coptic-scriptorium", "croatian-set", "czech-cac", "czech-cltt", "czech-fictree", "czech-pdt", "danish-ddt", "dutch-alpino", "dutch-lassysmall", "english-ewt", "english-gum", "english-lines", "english-partut", "estonian-edt", "estonian-ewt", "finnish-ftb", "finnish-tdt", "french-gsd", "french-partut", "french-sequoia", "french-spoken", "galician-ctg", "galician-treegal", "german-gsd", "german-hdt", "gothic-proiel", "greek-gdt", "hebrew-htb", "hindi-hdtb", "hungarian-szeged", "indonesian-gsd", "irish-idt", "italian-isdt", "italian-partut", "italian-postwita", "italian-twittiro", "italian-vit", "japanese-gsd", "kazakh-ktb", "korean-gsd", "korean-kaist", "kurmanji-mg", "latin-ittb", "latin-perseus", "latin-proiel", "latvian-lvtb", "lithuanian-alksnis", "lithuanian-hse", "maltese-mudt", "marathi-ufal", "north_sami-giella", "norwegian-bokmaal", "norwegian-nynorsk", "norwegian-nynorsklia", "old_church_slavonic-proiel", "old_french-srcmf", "old_russian-torot", "persian-seraji", "polish-lfg", "polish-pdb", "polish-sz", "portuguese-bosque", "portuguese-br", "portuguese-gsd", "romanian-nonstandard", "romanian-rrt", "russian-gsd", "russian-syntagrus", "russian-taiga", "sanskrit-ufal", "scottish_gaelic-arcosg", "serbian-set", "slovak-snk", "slovenian-ssj", "slovenian-sst", "spanish-ancora", "spanish-gsd", "swedish-lines", "swedish-talbanken", "tamil-ttb", "telugu-mtg", "turkish-imst", "ukrainian-iu", "upper_sorbian-ufal", "urdu-udtb", "uyghur-udt", "vietnamese-vtb", "wolof-wtb"), model_dir = getwd(), udpipe_model_repo = c("jwijffels/udpipe.models.ud.2.5", "jwijffels/udpipe.models.ud.2.4", "jwijffels/udpipe.models.ud.2.3", "jwijffels/udpipe.models.ud.2.0", "jwijffels/udpipe.models.conll18.baseline", "bnosac/udpipe.models.ud"), overwrite = TRUE, ... )
language |
a character string with a Universal Dependencies treebank which was used to build the model. Possible values are: Each language should have a treebank extension (e.g. english-ewt, russian-syntagrus, dutch-alpino, ...). If you do not provide a treebank extension (e.g. only english, russian, dutch), the function will use the default treebank of that language as was used in Universal Dependencies up to version 2.1. |
model_dir |
a path where the model will be downloaded to. Defaults to the current working directory |
udpipe_model_repo |
location where the models will be downloaded from.
Either 'jwijffels/udpipe.models.ud.2.5', 'jwijffels/udpipe.models.ud.2.4', 'jwijffels/udpipe.models.ud.2.3', 'jwijffels/udpipe.models.ud.2.0', 'jwijffels/udpipe.models.conll18.baseline' or 'bnosac/udpipe.models.ud'.
See the Details section for further information on which languages are available in each of these repositories. |
overwrite |
logical indicating to overwrite the file if the file was already downloaded. Defaults to |
... |
currently not used |
The function allows you to download the following language models based on your setting of argument udpipe_model_repo
:
'jwijffels/udpipe.models.ud.2.5': https://github.com/jwijffels/udpipe.models.ud.2.5
UDPipe models constructed on data from Universal Dependencies 2.5
languages-treebanks: afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, catalan-ancora, chinese-gsd, chinese-gsdsimp, classical_chinese-kyoto, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, estonian-ewt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, german-hdt, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-twittiro, italian-vit, japanese-gsd, korean-gsd, korean-kaist, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-alksnis, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, old_russian-torot, persian-seraji, polish-lfg, polish-pdb, portuguese-bosque, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, scottish_gaelic-arcosg, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, urdu-udtb, uyghur-udt, vietnamese-vtb, wolof-wtb
license: CC-BY-SA-NC
'jwijffels/udpipe.models.ud.2.4': https://github.com/jwijffels/udpipe.models.ud.2.4
UDPipe models constructed on data from Universal Dependencies 2.4
languages-treebanks: afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, catalan-ancora, chinese-gsd, classical_chinese-kyoto, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, estonian-ewt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-vit, japanese-gsd, korean-gsd, korean-kaist, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-alksnis, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, old_russian-torot, persian-seraji, polish-lfg, polish-pdb, portuguese-bosque, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, urdu-udtb, uyghur-udt, vietnamese-vtb, wolof-wtb
license: CC-BY-SA-NC
'jwijffels/udpipe.models.ud.2.3': https://github.com/jwijffels/udpipe.models.ud.2.3
UDPipe models constructed on data from Universal Dependencies 2.3
languages-treebanks: afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, catalan-ancora, chinese-gsd, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, japanese-gsd, korean-gsd, korean-kaist, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, persian-seraji, polish-lfg, polish-sz, portuguese-bosque, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, urdu-udtb, uyghur-udt, vietnamese-vtb
license: CC-BY-SA-NC
'jwijffels/udpipe.models.ud.2.0': https://github.com/jwijffels/udpipe.models.ud.2.0
UDPipe models constructed on data from Universal Dependencies 2.0
languages-treebanks: ancient_greek-proiel, ancient_greek, arabic, basque, belarusian, bulgarian, catalan, chinese, coptic, croatian, czech-cac, czech-cltt, czech, danish, dutch-lassysmall, dutch, english-lines, english-partut, english, estonian, finnish-ftb, finnish, french-partut, french-sequoia, french, galician-treegal, galician, german, gothic, greek, hebrew, hindi, hungarian, indonesian, irish, italian, japanese, kazakh, korean, latin-ittb, latin-proiel, latin, latvian, lithuanian, norwegian-bokmaal, norwegian-nynorsk, old_church_slavonic, persian, polish, portuguese-br, portuguese, romanian, russian-syntagrus, russian, sanskrit, slovak, slovenian-sst, slovenian, spanish-ancora, spanish, swedish-lines, swedish, tamil, turkish, ukrainian, urdu, uyghur, vietnamese
license: CC-BY-SA-NC
'jwijffels/udpipe.models.conll18.baseline': https://github.com/jwijffels/udpipe.models.conll18.baseline
UDPipe models constructed on data from Universal Dependencies 2.2
languages-treebanks: afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, croatian-set, czech-cac, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, estonian-edt, finnish-ftb, finnish-tdt, french-gsd, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-postwita, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, mixed, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, persian-seraji, polish-lfg, polish-sz, portuguese-bosque, romanian-rrt, russian-syntagrus, russian-taiga, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, swedish-lines, swedish-talbanken, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb
license: CC-BY-SA-NC
'bnosac/udpipe.models.ud': https://github.com/bnosac/udpipe.models.ud
UDPipe models constructed on data from Universal Dependencies 2.1
This repository contains models build with this R package on open data from Universal Dependencies 2.1 which allows for commercial usage. The license of these models is mostly CC-BY-SA. Visit that github repository for details on the licenses of the language of your choice. And contact www.bnosac.be if you need support on these models or require models tuned to your needs.
languages-treebanks: afrikaans, croatian, czech-cac, dutch, english, finnish, french-sequoia, irish, norwegian-bokmaal, persian, polish, portuguese, romanian, serbian, slovak, spanish-ancora, swedish
license: license is treebank-specific but mainly CC-BY-SA and GPL-3 and LGPL-LR
If you need to train models yourself for commercial purposes or if you want to improve models, you can easily do this with udpipe_train
which is explained in detail in the package vignette.
Note that when you download these models, you comply to the license of your specific language model.
A data.frame with 1 row and the following columns:
language: The language as provided by the input parameter language
file_model: The path to the file on disk where the model was downloaded to
url: The URL where the model was downloaded from
download_failed: A logical indicating if the download has failed or not due to internet connectivity issues
download_message: A character string with the error message in case the downloading of the model failed
https://ufal.mff.cuni.cz/udpipe, https://github.com/jwijffels/udpipe.models.ud.2.5, https://github.com/jwijffels/udpipe.models.ud.2.4, https://github.com/jwijffels/udpipe.models.ud.2.3, https://github.com/jwijffels/udpipe.models.conll18.baseline https://github.com/jwijffels/udpipe.models.ud.2.0, https://github.com/bnosac/udpipe.models.ud
## Not run: x <- udpipe_download_model(language = "dutch-alpino") x <- udpipe_download_model(language = "dutch-lassysmall") x <- udpipe_download_model(language = "russian") x <- udpipe_download_model(language = "french") x <- udpipe_download_model(language = "english-partut") x <- udpipe_download_model(language = "english-ewt") x <- udpipe_download_model(language = "german-gsd") x <- udpipe_download_model(language = "spanish-gsd") x <- udpipe_download_model(language = "spanish-gsd", overwrite = FALSE) x <- udpipe_download_model(language = "dutch-alpino", udpipe_model_repo = "jwijffels/udpipe.models.ud.2.5") x <- udpipe_download_model(language = "dutch-alpino", udpipe_model_repo = "jwijffels/udpipe.models.ud.2.4") x <- udpipe_download_model(language = "dutch-alpino", udpipe_model_repo = "jwijffels/udpipe.models.ud.2.3") x <- udpipe_download_model(language = "dutch-alpino", udpipe_model_repo = "jwijffels/udpipe.models.ud.2.0") x <- udpipe_download_model(language = "english", udpipe_model_repo = "bnosac/udpipe.models.ud") x <- udpipe_download_model(language = "dutch", udpipe_model_repo = "bnosac/udpipe.models.ud") x <- udpipe_download_model(language = "afrikaans", udpipe_model_repo = "bnosac/udpipe.models.ud") x <- udpipe_download_model(language = "spanish-ancora", udpipe_model_repo = "bnosac/udpipe.models.ud") x <- udpipe_download_model(language = "dutch-ud-2.1-20180111.udpipe", udpipe_model_repo = "bnosac/udpipe.models.ud") x <- udpipe_download_model(language = "english", udpipe_model_repo = "jwijffels/udpipe.models.conll18.baseline") ## End(Not run) x <- udpipe_download_model(language = "sanskrit", udpipe_model_repo = "jwijffels/udpipe.models.ud.2.0", model_dir = tempdir()) x ## cleanup for CRAN if(file.exists(x$file_model)) file.remove(x$file_model)
## Not run: x <- udpipe_download_model(language = "dutch-alpino") x <- udpipe_download_model(language = "dutch-lassysmall") x <- udpipe_download_model(language = "russian") x <- udpipe_download_model(language = "french") x <- udpipe_download_model(language = "english-partut") x <- udpipe_download_model(language = "english-ewt") x <- udpipe_download_model(language = "german-gsd") x <- udpipe_download_model(language = "spanish-gsd") x <- udpipe_download_model(language = "spanish-gsd", overwrite = FALSE) x <- udpipe_download_model(language = "dutch-alpino", udpipe_model_repo = "jwijffels/udpipe.models.ud.2.5") x <- udpipe_download_model(language = "dutch-alpino", udpipe_model_repo = "jwijffels/udpipe.models.ud.2.4") x <- udpipe_download_model(language = "dutch-alpino", udpipe_model_repo = "jwijffels/udpipe.models.ud.2.3") x <- udpipe_download_model(language = "dutch-alpino", udpipe_model_repo = "jwijffels/udpipe.models.ud.2.0") x <- udpipe_download_model(language = "english", udpipe_model_repo = "bnosac/udpipe.models.ud") x <- udpipe_download_model(language = "dutch", udpipe_model_repo = "bnosac/udpipe.models.ud") x <- udpipe_download_model(language = "afrikaans", udpipe_model_repo = "bnosac/udpipe.models.ud") x <- udpipe_download_model(language = "spanish-ancora", udpipe_model_repo = "bnosac/udpipe.models.ud") x <- udpipe_download_model(language = "dutch-ud-2.1-20180111.udpipe", udpipe_model_repo = "bnosac/udpipe.models.ud") x <- udpipe_download_model(language = "english", udpipe_model_repo = "jwijffels/udpipe.models.conll18.baseline") ## End(Not run) x <- udpipe_download_model(language = "sanskrit", udpipe_model_repo = "jwijffels/udpipe.models.ud.2.0", model_dir = tempdir()) x ## cleanup for CRAN if(file.exists(x$file_model)) file.remove(x$file_model)
Load an UDPipe model so that it can be use in udpipe_annotate
udpipe_load_model(file)
udpipe_load_model(file)
file |
full path to the model or the value returned by a call to |
An object of class udpipe_model
which is a list with 2 elements
file: The path to the model as provided by file
model: An Rcpp-generated pointer to the loaded model which can be used in udpipe_annotate
https://ufal.mff.cuni.cz/udpipe
udpipe_annotate
, udpipe_download_model
, udpipe_train
## Not run: x <- udpipe_download_model(language = "dutch-lassysmall") x$file_model ud_english <- udpipe_load_model(x$file_model) x <- udpipe_download_model(language = "english") x$file_model ud_english <- udpipe_load_model(x$file_model) x <- udpipe_download_model(language = "hebrew") x$file_model ud_hebrew <- udpipe_load_model(x$file_model) ## End(Not run) x <- udpipe_download_model(language = "dutch-lassysmall", model_dir = tempdir()) x$file_model if(!x$download_failed){ ud_dutch <- udpipe_load_model(x$file_model) } ## cleanup for CRAN if(file.exists(x$file_model)) file.remove(x$file_model)
## Not run: x <- udpipe_download_model(language = "dutch-lassysmall") x$file_model ud_english <- udpipe_load_model(x$file_model) x <- udpipe_download_model(language = "english") x$file_model ud_english <- udpipe_load_model(x$file_model) x <- udpipe_download_model(language = "hebrew") x$file_model ud_hebrew <- udpipe_load_model(x$file_model) ## End(Not run) x <- udpipe_download_model(language = "dutch-lassysmall", model_dir = tempdir()) x$file_model if(!x$download_failed){ ud_dutch <- udpipe_load_model(x$file_model) } ## cleanup for CRAN if(file.exists(x$file_model)) file.remove(x$file_model)
Read in a CONLL-U file as a data.frame
udpipe_read_conllu(file)
udpipe_read_conllu(file)
file |
a connection object or a character string with the location of the file |
a data.frame with columns doc_id, paragraph_id, sentence_id, sentence, token_id, token, lemma, upos, xpos, feats, head_token_id, deprel, dep_rel, misc
file_conllu <- system.file(package = "udpipe", "dummydata", "traindata.conllu") x <- udpipe_read_conllu(file_conllu) head(x)
file_conllu <- system.file(package = "udpipe", "dummydata", "traindata.conllu") x <- udpipe_read_conllu(file_conllu) head(x)
Train a UDPipe model which allows to do
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing or a combination of those.
This function allows you to build models based on data in in CONLL-U format
as described at https://universaldependencies.org/format.html. At the time of writing open data in CONLL-U
format for more than 50 languages are available at https://universaldependencies.org.
Most of these are distributed under the CC-BY-SA licence or the CC-BY-NC-SA license.
This function allows to build annotation tagger models based on these data in CONLL-U format, allowing you
to have your own tagger model. This is relevant if you want to tune the tagger to your needs
or if you don't want to use ready-made models provided under the CC-BY-NC-SA license as shown at udpipe_load_model
udpipe_train( file = file.path(getwd(), "my_annotator.udpipe"), files_conllu_training, files_conllu_holdout = character(), annotation_tokenizer = "default", annotation_tagger = "default", annotation_parser = "default" )
udpipe_train( file = file.path(getwd(), "my_annotator.udpipe"), files_conllu_training, files_conllu_holdout = character(), annotation_tokenizer = "default", annotation_tagger = "default", annotation_parser = "default" )
file |
full path where the model will be saved. The model will be stored as a binary file which |
files_conllu_training |
a character vector of files in CONLL-U format used for training the model |
files_conllu_holdout |
a character vector of files in CONLL-U format used for holdout evalution of the model. This argument is optional. |
annotation_tokenizer |
a string containing options for the tokenizer. This can be either 'none' or 'default' or a list
of options as mentioned in the UDPipe manual. See the vignette |
annotation_tagger |
a string containing options for the pos tagger and lemmatiser. This can be either 'none' or 'default' or a list
of options as mentioned in the UDPipe manual. See the vignette |
annotation_parser |
a string containing options for the dependency parser. This can be either 'none' or 'default' or a list
of options as mentioned in the UDPipe manual. See the vignette |
In order to train a model, you need to provide files which are in CONLL-U format in argument files_conllu_training
.
This can be a vector of files or just one file. If you do not have your own CONLL-U files, you can download files for your language of
choice at https://universaldependencies.org.
At the time of writing open data in CONLL-U format for 50 languages are available at https://universaldependencies.org, namely for: ancient_greek, arabic, basque, belarusian, bulgarian, catalan, chinese, coptic, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, german, gothic, greek, hebrew, hindi, hungarian, indonesian, irish, italian, japanese, kazakh, korean, latin, latvian, lithuanian, norwegian, old_church_slavonic, persian, polish, portuguese, romanian, russian, sanskrit, slovak, slovenian, spanish, swedish, tamil, turkish, ukrainian, urdu, uyghur, vietnamese.
A list with elements
file: The path to the model, which can be used in udpipe_load_model
annotation_tokenizer: The input argument annotation_tokenizer
annotation_tagger: The input argument annotation_tagger
annotation_parser: The input argument annotation_parser
errors: Messages from the UDPipe process indicating possible errors for example when passing the wrong arguments to the annotation_tokenizer, annotation_tagger or annotation_parser
https://ufal.mff.cuni.cz/udpipe/1/users-manual
udpipe_annotation_params
, udpipe_annotate
, udpipe_load_model
,
udpipe_accuracy
## You need to have a file on disk in CONLL-U format, taking the toy example file put in the package file_conllu <- system.file(package = "udpipe", "dummydata", "traindata.conllu") file_conllu cat(head(readLines(file_conllu), 3), sep="\n") ## Not run: ## ## This is a toy example showing how to build a model, it is not a good model whatsoever, ## because model building takes more than 5 seconds this model is saved also in ## the file at system.file(package = "udpipe", "dummydata", "toymodel.udpipe") ## m <- udpipe_train(file = "toymodel.udpipe", files_conllu_training = file_conllu, annotation_tokenizer = list(dimension = 16, epochs = 1, batch_size = 100, dropout = 0.7), annotation_tagger = list(iterations = 1, models = 1, provide_xpostag = 1, provide_lemma = 0, provide_feats = 0, guesser_suffix_rules = 2, guesser_prefix_min_count = 2), annotation_parser = list(iterations = 2, embedding_upostag = 20, embedding_feats = 20, embedding_xpostag = 0, embedding_form = 50, embedding_lemma = 0, embedding_deprel = 20, learning_rate = 0.01, learning_rate_final = 0.001, l2 = 0.5, hidden_layer = 200, batch_size = 10, transition_system = "projective", transition_oracle = "dynamic", structured_interval = 10)) ## End(Not run) file_model <- system.file(package = "udpipe", "dummydata", "toymodel.udpipe") ud_toymodel <- udpipe_load_model(file_model) x <- udpipe_annotate(object = ud_toymodel, x = "Ik ging deze morgen naar de bakker brood halen.") x <- as.data.frame(x) ## ## The above was a toy example showing how to build a model, if you want real-life scenario's ## look at the training parameter examples given below and train it on your CONLL-U file ## ## Example training arguments used for the models available at udpipe_download_model data(udpipe_annotation_params) head(udpipe_annotation_params$tokenizer) head(udpipe_annotation_params$tagger) head(udpipe_annotation_params$parser) ## Not run: ## More details in the package vignette: vignette("udpipe-train", package = "udpipe") ## End(Not run)
## You need to have a file on disk in CONLL-U format, taking the toy example file put in the package file_conllu <- system.file(package = "udpipe", "dummydata", "traindata.conllu") file_conllu cat(head(readLines(file_conllu), 3), sep="\n") ## Not run: ## ## This is a toy example showing how to build a model, it is not a good model whatsoever, ## because model building takes more than 5 seconds this model is saved also in ## the file at system.file(package = "udpipe", "dummydata", "toymodel.udpipe") ## m <- udpipe_train(file = "toymodel.udpipe", files_conllu_training = file_conllu, annotation_tokenizer = list(dimension = 16, epochs = 1, batch_size = 100, dropout = 0.7), annotation_tagger = list(iterations = 1, models = 1, provide_xpostag = 1, provide_lemma = 0, provide_feats = 0, guesser_suffix_rules = 2, guesser_prefix_min_count = 2), annotation_parser = list(iterations = 2, embedding_upostag = 20, embedding_feats = 20, embedding_xpostag = 0, embedding_form = 50, embedding_lemma = 0, embedding_deprel = 20, learning_rate = 0.01, learning_rate_final = 0.001, l2 = 0.5, hidden_layer = 200, batch_size = 10, transition_system = "projective", transition_oracle = "dynamic", structured_interval = 10)) ## End(Not run) file_model <- system.file(package = "udpipe", "dummydata", "toymodel.udpipe") ud_toymodel <- udpipe_load_model(file_model) x <- udpipe_annotate(object = ud_toymodel, x = "Ik ging deze morgen naar de bakker brood halen.") x <- as.data.frame(x) ## ## The above was a toy example showing how to build a model, if you want real-life scenario's ## look at the training parameter examples given below and train it on your CONLL-U file ## ## Example training arguments used for the models available at udpipe_download_model data(udpipe_annotation_params) head(udpipe_annotation_params$tokenizer) head(udpipe_annotation_params$tagger) head(udpipe_annotation_params$parser) ## Not run: ## More details in the package vignette: vignette("udpipe-train", package = "udpipe") ## End(Not run)
Create a unique identifier for each combination of fields in a data frame.
This unique identifier is unique for each combination of the elements of the fields.
The generated identifier is like a primary key or a secondary key on a table.
This is just a small wrapper around frank
unique_identifier(x, fields, start_from = 1L)
unique_identifier(x, fields, start_from = 1L)
x |
a data.frame |
fields |
a character vector of columns from |
start_from |
integer number indicating to start from that number onwards |
an integer vector of the same length as the number of rows in x
containing the unique identifier
data(brussels_reviews_anno) x <- brussels_reviews_anno x$doc_sent_id <- unique_identifier(x, fields = c("doc_id", "sentence_id")) head(x, 15) range(x$doc_sent_id) x$doc_sent_id <- unique_identifier(x, fields = c("doc_id", "sentence_id"), start_from = 10) head(x, 15) range(x$doc_sent_id)
data(brussels_reviews_anno) x <- brussels_reviews_anno x$doc_sent_id <- unique_identifier(x, fields = c("doc_id", "sentence_id")) head(x, 15) range(x$doc_sent_id) x$doc_sent_id <- unique_identifier(x, fields = c("doc_id", "sentence_id"), start_from = 10) head(x, 15) range(x$doc_sent_id)
Create a data.frame from a list of tokens.
unlist_tokens(x)
unlist_tokens(x)
x |
a list where the list elements are character vectors of tokens |
the data of x
converted to a data.frame.
This data.frame has columns doc_id and token where the doc_id is taken from the list names of x
and token contains the data of x
x <- setNames(c("some text here", "hi there understand this?"), c("a", "b")) x <- strsplit(x, split = " ") x unlist_tokens(x)
x <- setNames(c("some text here", "hi there understand this?"), c("a", "b")) x <- strsplit(x, split = " ") x unlist_tokens(x)