Title: | Conditional Random Fields for Labelling Sequential Data in Natural Language Processing |
---|---|
Description: | Wraps the 'CRFsuite' library <https://github.com/chokkan/crfsuite> allowing users to fit a Conditional Random Field model and to apply it on existing data. The focus of the implementation is in the area of Natural Language Processing where this R package allows you to easily build and apply models for named entity recognition, text chunking, part of speech tagging, intent recognition or classification of any category you have in mind. Next to training, a small web application is included in the package to allow you to easily construct training data. |
Authors: | Jan Wijffels [aut, cre, cph] (R wrapper), BNOSAC [cph] (R wrapper), Naoaki Okazaki [aut, ctb, cph] (CRFsuite library (BSD licensed), libLBFGS library (MIT licensed), Constant Quark Database software (BSD licensed)), Bob Jenkins [aut, ctb] (File src/cqdb/src/lookup3.c (Public Domain)), Jorge Nocedal [aut, ctb, cph] (libLBFGS library (MIT licensed)), Jesse Long [aut, ctb, cph] (RumAVL library (MIT licensed)) |
Maintainer: | Jan Wijffels <[email protected]> |
License: | BSD_3_clause + file LICENSE |
Version: | 0.4.2 |
Built: | 2024-11-06 04:42:30 UTC |
Source: | https://github.com/bnosac/crfsuite |
The data contains 500 reviews in Dutch of people who visited an AirBnB appartment in Brussels.
The data frame contains the fields
doc_id: a unique identifier of the review
listing_id: the airbnb address identifier
text: text with the feedback of a customer on his visit in the AirBnB appartment
http://insideairbnb.com/brussels: information of 2015-10-03
data(airbnb) str(airbnb) head(airbnb)
data(airbnb) str(airbnb) head(airbnb)
The airbnb
dataset was manually annotated with the shiny app inside this R package.
The annotation shows chunks of data which have been flagged with the following categories: PERSON, LOCATION, DISTANCE.
The dataset is an object of class chunkrange
and of type data.frame which contains the following fields:
doc_id: a unique identifier of the review, which is also available in airbnb
listing_id: the airbnb address identifier
text: text with the feedback of a customer on his visit in the AirBnB appartment
chunk_id: a chunk identifier
chunk_entity: a chunk entity label
chunk: the text of the chunk which is a substring of text
start: the starting position in text
where the chunk
is found
end: the end position in text
where the chunk
is found
data(airbnb_chunks) str(airbnb_chunks) head(airbnb_chunks)
data(airbnb_chunks) str(airbnb_chunks) head(airbnb_chunks)
If you have a model built with CRFsuite either by this R package
or by another software library which wraps CRFsuite (e.g. Python/Java), you can
convert it to an object of class crf
which this package can use
to inspect the model and to use it for prediction (if you can mimic the way the attributes are created).
This is for expert use only.
as.crf(file, ...)
as.crf(file, ...)
file |
the path to a file on disk containing the CRFsuite model |
... |
other arguments which can be set except the path to the file, namely method, type, options, attribute_names, log (expert use only) |
an object of class crf
Fits a Linear-chain (first-order Markov) CRF on the provided label sequence and saves it on disk in order to do sequence labelling.
crf( x, y, group, method = c("lbfgs", "l2sgd", "averaged-perceptron", "passive-aggressive", "arow"), options = crf_options(method)$default, embeddings, file = "annotator.crfsuite", trace = FALSE, FUN = identity, ... )
crf( x, y, group, method = c("lbfgs", "l2sgd", "averaged-perceptron", "passive-aggressive", "arow"), options = crf_options(method)$default, embeddings, file = "annotator.crfsuite", trace = FALSE, FUN = identity, ... )
x |
a character matrix of data containing attributes about the label sequence |
y |
a character vector with the sequence of labels to model |
group |
an integer or character vector of the same length as |
method |
character string with the type of training method. Either one of:
|
options |
a list of options to provide to the training algorithm. See |
embeddings |
a matrix with the same number of rows as |
file |
a character string with the path to the file on disk where the CRF model will be stored. |
trace |
a logical indicating to show the trace of the training output. Defaults to |
FUN |
a function which can be applied on raw text in order to obtain the attribute matrix used in |
... |
arguments to FUN. Currently not used yet. |
an object of class crf which is a list with elements
method: The training method
type: The type of graphical model which is always set crf1d: Linear-chain (first-order Markov) CRF
labels: The training labels
options: A data.frame with the training options provided to the algorithm
file_model: The path where the CRF model is stored
attribute_names: The column names of x
log: The training log of the algorithm
FUN: The argument passed on to FUN
ldots: A list with the arguments passed on to ...
More details about this model is available at http://www.chokkan.org/software/crfsuite/.
## Download modeldata (conll 2002 shared task in Dutch) x <- ner_download_modeldata("conll2002-nl") # for CRAN only - word on a subset of the data x <- ner_download_modeldata("conll2002-nl", docs = 10) if(is.data.frame(x)){ ## ## Build Named Entity Recognition model on conll2002-nl ## x$pos <- txt_sprintf("Parts of Speech: %s", x$pos) x$token <- txt_sprintf("Token: %s", x$token) crf_train <- subset(x, data == "ned.train") crf_test <- subset(x, data == "testa") model <- crf(y = crf_train$label, x = crf_train[, c("token", "pos")], group = crf_train$doc_id, method = "lbfgs", options = list(max_iterations = 3, feature.minfreq = 5, c1 = 0, c2 = 1)) model weights <- coefficients(model) head(weights$states, n = 20) head(weights$transitions, n = 20) stats <- summary(model, "modeldetails.txt") stats plot(stats$iterations$loss) ## Use the CRF model to label a sequence scores <- predict(model, newdata = crf_test[, c("token", "pos")], group = crf_test$doc_id) head(scores) crf_test$label <- scores$label ## cleanup for CRAN if(file.exists(model$file_model)) file.remove(model$file_model) if(file.exists("modeldetails.txt")) file.remove("modeldetails.txt") } ## ## More detailed example where text data was annotated with the webapp in the package ## This data is joined with a tokenised dataset to construct the training data which ## is further enriched with attributes of upos/lemma in the neighbourhood ## library(udpipe) data(airbnb_chunks, package = "crfsuite") udmodel <- udpipe_download_model("dutch-lassysmall") if(!udmodel$download_failed){ udmodel <- udpipe_load_model(udmodel$file_model) airbnb_tokens <- udpipe(x = unique(airbnb_chunks[, c("doc_id", "text")]), object = udmodel) x <- merge(airbnb_chunks, airbnb_tokens) x <- crf_cbind_attributes(x, terms = c("upos", "lemma"), by = "doc_id") model <- crf(y = x$chunk_entity, x = x[, grep("upos|lemma", colnames(x), value = TRUE)], group = x$doc_id, method = "lbfgs", options = list(max_iterations = 5)) stats <- summary(model) stats plot(stats$iterations$loss, type = "b", xlab = "Iteration", ylab = "Loss") scores <- predict(model, newdata = x[, grep("upos|lemma", colnames(x))], group = x$doc_id) head(scores) }
## Download modeldata (conll 2002 shared task in Dutch) x <- ner_download_modeldata("conll2002-nl") # for CRAN only - word on a subset of the data x <- ner_download_modeldata("conll2002-nl", docs = 10) if(is.data.frame(x)){ ## ## Build Named Entity Recognition model on conll2002-nl ## x$pos <- txt_sprintf("Parts of Speech: %s", x$pos) x$token <- txt_sprintf("Token: %s", x$token) crf_train <- subset(x, data == "ned.train") crf_test <- subset(x, data == "testa") model <- crf(y = crf_train$label, x = crf_train[, c("token", "pos")], group = crf_train$doc_id, method = "lbfgs", options = list(max_iterations = 3, feature.minfreq = 5, c1 = 0, c2 = 1)) model weights <- coefficients(model) head(weights$states, n = 20) head(weights$transitions, n = 20) stats <- summary(model, "modeldetails.txt") stats plot(stats$iterations$loss) ## Use the CRF model to label a sequence scores <- predict(model, newdata = crf_test[, c("token", "pos")], group = crf_test$doc_id) head(scores) crf_test$label <- scores$label ## cleanup for CRAN if(file.exists(model$file_model)) file.remove(model$file_model) if(file.exists("modeldetails.txt")) file.remove("modeldetails.txt") } ## ## More detailed example where text data was annotated with the webapp in the package ## This data is joined with a tokenised dataset to construct the training data which ## is further enriched with attributes of upos/lemma in the neighbourhood ## library(udpipe) data(airbnb_chunks, package = "crfsuite") udmodel <- udpipe_download_model("dutch-lassysmall") if(!udmodel$download_failed){ udmodel <- udpipe_load_model(udmodel$file_model) airbnb_tokens <- udpipe(x = unique(airbnb_chunks[, c("doc_id", "text")]), object = udmodel) x <- merge(airbnb_chunks, airbnb_tokens) x <- crf_cbind_attributes(x, terms = c("upos", "lemma"), by = "doc_id") model <- crf(y = x$chunk_entity, x = x[, grep("upos|lemma", colnames(x), value = TRUE)], group = x$doc_id, method = "lbfgs", options = list(max_iterations = 5)) stats <- summary(model) stats plot(stats$iterations$loss, type = "b", xlab = "Iteration", ylab = "Loss") scores <- predict(model, newdata = x[, grep("upos|lemma", colnames(x))], group = x$doc_id) head(scores) }
The object crf_caretmethod
contains functionality to tune a crf model using caret.
Each list elment of crf_caretmethod
is a list of functions
which can be passed on to the method
argument of caret::train
to tune the hyperparameters of the crfsuite model.
crf_caretmethod
crf_caretmethod
see details
If you want to tune the hyperparameters of a crfsuite model
(see crf_options
and the options
argument of crf
), you can use the caret
package.
In order to facilitate this tuning, an object called crf_caretmethod
has been made available.
The object crf_caretmethod
is a list with 6 elements, where each of these 6 elements can be used in
tuning the CRF hyperparemeters by passing it on to the method
argument of the train
function of the caret
package.
The list has elements 'default', 'lbfgs', 'l2sgd', 'averaged_perceptron', 'passive_aggressive' and 'arow'.
Each list element corresponds to arguments that you need to tune for each method
as used in crf
.
For crf_caretmethod
lbfgs: Tuning across all hyperparameters for method lbfgs: L-BFGS with L1/L2 regularization
l2sgd: Tuning across all hyperparameters for method l2sgd: SGD with L2-regularization
averaged_perceptron: Tuning across all hyperparameters for method averaged-perceptron: Averaged Perceptron
passive_aggressive: Tuning across all hyperparameters for method passive-aggressive: Passive Aggressive
arow: Tuning across all hyperparameters for method arow: Adaptive Regularization of Weights (AROW)
default: Tune over the hyperparameters feature.minfreq, feature.possible_states, feature.possible_transitions, max_iterations. While tuning these, it uses the default hyperparameters for each method. This tuning allows you to compare the 5 methods.
For details on the hyperparameter definitions: see crf_options
The CRF attributes which are implemented in this function
are merely the neighbouring information of a certain field.
For example the previous word, the next word, the combination of the previous 2 words.
This function cbind
s these neighbouring attributes as columns to the provided data.frame.
By default it adds the following columns to the data.frame
the term itself (term[t])
the next term (term[t+1])
the term after that (term[t+2])
the previous term (term[t-1])
the term before the previous term (term[t-2])
as well as all combinations of these terms (bigrams/trigrams/...) where up to ngram_max
number of terms are combined.
See the examples.
crf_cbind_attributes( data, terms, by, from = -2, to = 2, ngram_max = 3, sep = "-" )
crf_cbind_attributes( data, terms, by, from = -2, to = 2, ngram_max = 3, sep = "-" )
data |
a data.frame which will be coerced to a data.table (cbinding will be done by reference on the existing data.frame) |
terms |
a character vector of column names which are part of |
by |
a character vector of column names which are part of |
from |
integer, by default set to -2, indicating to look up to 2 terms before the current term |
to |
integer, by default set to 2, indicating to look up to 2 terms after the current term |
ngram_max |
integer indicating the maximum number of terms to combine (2 means bigrams, 3 trigrams, ...) |
sep |
character indicating how to combine the previous/next/current terms. Defaults to '-'. |
x <- data.frame(doc_id = sort(sample.int(n = 10, size = 1000, replace = TRUE))) x$pos <- sample(c("Art", "N", "Prep", "V", "Adv", "Adj", "Conj", "Punc", "Num", "Pron", "Int", "Misc"), size = nrow(x), replace = TRUE) x <- crf_cbind_attributes(x, terms = "pos", by = "doc_id", from = -1, to = 1, ngram_max = 3) head(x) ## Example on some real data x <- ner_download_modeldata("conll2002-nl") x <- crf_cbind_attributes(x, terms = c("token", "pos"), by = c("doc_id", "sentence_id"), ngram_max = 3, sep = "|")
x <- data.frame(doc_id = sort(sample.int(n = 10, size = 1000, replace = TRUE))) x$pos <- sample(c("Art", "N", "Prep", "V", "Adv", "Adj", "Conj", "Punc", "Num", "Pron", "Int", "Misc"), size = nrow(x), replace = TRUE) x <- crf_cbind_attributes(x, terms = "pos", by = "doc_id", from = -1, to = 1, ngram_max = 3) head(x) ## Example on some real data x <- ner_download_modeldata("conll2002-nl") x <- crf_cbind_attributes(x, terms = c("token", "pos"), by = c("doc_id", "sentence_id"), ngram_max = 3, sep = "|")
The accuracy, precision, recall, specificity, F1 measure and support metrics are provided for each label in a one-versus the rest setting.
crf_evaluation( pred, obs, labels = na.exclude(unique(c(as.character(pred), as.character(obs)))), labels_overall = setdiff(labels, "O") )
crf_evaluation( pred, obs, labels = na.exclude(unique(c(as.character(pred), as.character(obs)))), labels_overall = setdiff(labels, "O") )
pred |
a factor with predictions |
obs |
a factor with gold labels |
labels |
a character vector of possible values that |
labels_overall |
a character vector of either labels which is either the same as |
a list with 2 elements:
bylabel: data.frame with the accuracy, precision, recall, specificity, F1 score and support (number of occurrences) for each label
overall: a vector containing
the overall accuracy
the metrics precision, recall, specificity and F1 score which are weighted averages of these metrics from list element bylabel
, where the weight is the support
the metrics precision, recall, specificity and F1 score which are averages of these metrics from list element bylabel
giving equal weight to each label
pred <- sample(LETTERS, 1000, replace = TRUE) gold <- sample(LETTERS, 1000, replace = TRUE) crf_evaluation(pred = pred, obs = gold, labels = LETTERS) x <- ner_download_modeldata("conll2002-nl") x <- crf_cbind_attributes(x, terms = c("token", "pos"), by = c("doc_id", "sentence_id")) crf_train <- subset(x, data == "ned.train") crf_test <- subset(x, data == "testa") attributes <- grep("token|pos", colnames(x), value=TRUE) model <- crf(y = crf_train$label, x = crf_train[, attributes], group = crf_train$doc_id, method = "lbfgs") ## Use the model to score on existing tokenised data scores <- predict(model, newdata = crf_test[, attributes], group = crf_test$doc_id) crf_evaluation(pred = scores$label, obs = crf_test$label) crf_evaluation(pred = scores$label, obs = crf_test$label, labels = c("O", "B-ORG", "I-ORG", "B-PER", "I-PER", "B-LOC", "I-LOC", "B-MISC", "I-MISC")) library(udpipe) pred <- txt_recode(scores$label, from = c("B-ORG", "I-ORG", "B-PER", "I-PER", "B-LOC", "I-LOC", "B-MISC", "I-MISC"), to = c("ORG", "ORG", "PER", "PER", "LOC", "LOC", "MISC", "MISC")) obs <- txt_recode(crf_test$label, from = c("B-ORG", "I-ORG", "B-PER", "I-PER", "B-LOC", "I-LOC", "B-MISC", "I-MISC"), to = c("ORG", "ORG", "PER", "PER", "LOC", "LOC", "MISC", "MISC")) crf_evaluation(pred = pred, obs = obs, labels = c("ORG", "LOC", "PER", "MISC", "O"))
pred <- sample(LETTERS, 1000, replace = TRUE) gold <- sample(LETTERS, 1000, replace = TRUE) crf_evaluation(pred = pred, obs = gold, labels = LETTERS) x <- ner_download_modeldata("conll2002-nl") x <- crf_cbind_attributes(x, terms = c("token", "pos"), by = c("doc_id", "sentence_id")) crf_train <- subset(x, data == "ned.train") crf_test <- subset(x, data == "testa") attributes <- grep("token|pos", colnames(x), value=TRUE) model <- crf(y = crf_train$label, x = crf_train[, attributes], group = crf_train$doc_id, method = "lbfgs") ## Use the model to score on existing tokenised data scores <- predict(model, newdata = crf_test[, attributes], group = crf_test$doc_id) crf_evaluation(pred = scores$label, obs = crf_test$label) crf_evaluation(pred = scores$label, obs = crf_test$label, labels = c("O", "B-ORG", "I-ORG", "B-PER", "I-PER", "B-LOC", "I-LOC", "B-MISC", "I-MISC")) library(udpipe) pred <- txt_recode(scores$label, from = c("B-ORG", "I-ORG", "B-PER", "I-PER", "B-LOC", "I-LOC", "B-MISC", "I-MISC"), to = c("ORG", "ORG", "PER", "PER", "LOC", "LOC", "MISC", "MISC")) obs <- txt_recode(crf_test$label, from = c("B-ORG", "I-ORG", "B-PER", "I-PER", "B-LOC", "I-LOC", "B-MISC", "I-MISC"), to = c("ORG", "ORG", "PER", "PER", "LOC", "LOC", "MISC", "MISC")) crf_evaluation(pred = pred, obs = obs, labels = c("ORG", "LOC", "PER", "MISC", "O"))
Conditional Random Fields parameters
crf_options( method = c("lbfgs", "l2sgd", "averaged-perceptron", "passive-aggressive", "arow") )
crf_options( method = c("lbfgs", "l2sgd", "averaged-perceptron", "passive-aggressive", "arow") )
method |
character string with the type of training method. Either one of:
|
a list with elements
method: The training method
type: The type of graphical model which is always set crf1d: Linear-chain (first-order Markov) CRF
params: A data.frame with fields arg, arg_default and description indicating the possible hyperparameters of the algorithm, the default values and the description
default: A list of default values which can be used to pass on to the options
argument of crf
# L-BFGS with L1/L2 regularization opts <- crf_options("lbfgs") str(opts) # SGD with L2-regularization crf_options("l2sgd") # Averaged Perceptron crf_options("averaged-perceptron") # Passive Aggressive crf_options("passive-aggressive") # Adaptive Regularization of Weights (AROW) crf_options("arow")
# L-BFGS with L1/L2 regularization opts <- crf_options("lbfgs") str(opts) # SGD with L2-regularization crf_options("l2sgd") # Averaged Perceptron crf_options("averaged-perceptron") # Passive Aggressive crf_options("passive-aggressive") # Adaptive Regularization of Weights (AROW) crf_options("arow")
Chunks annotated with the shiny app in this R package indicate for a chunk of text of a document
the entity that it belongs to. As text chunks can contains several words, we need to have a way in
order to add this chunk category to each word of a tokenised dataset. That's what this function is doing.
If you have a tokenised data.frame with one row per token/document which indicates the start and end position
where the token is found in the text of the document, this function allows to assign the chunk label to each token
of the document.
## S3 method for class 'chunkrange' merge(x, y, by.x = "doc_id", by.y = "doc_id", default_entity = "O", ...)
## S3 method for class 'chunkrange' merge(x, y, by.x = "doc_id", by.y = "doc_id", default_entity = "O", ...)
x |
an object of class |
y |
a tokenised data.frame containing one row per doc_id/token It should have the columns |
by.x |
a character string of a column of |
by.y |
a character string of a column of |
default_entity |
character string with the default |
... |
not used |
the data.frame y
where 2 columns are added, namely:
chunk_entity: The chunk entity of the token if the token is inside the chunk defined in x
. If the token is not part of any chunk, the chunk category will be set to the default
value.
chunk_id: The chunk identifier of the chunk for which the token is inside the chunk.
library(udpipe) udmodel <- udpipe_download_model("dutch-lassysmall") if(packageVersion("udpipe") >= "0.7"){ data(airbnb_chunks, package = "crfsuite") airbnb_chunks <- head(airbnb_chunks, 20) airbnb_tokens <- unique(airbnb_chunks[, c("doc_id", "text")]) airbnb_tokens <- udpipe(airbnb_tokens, object = udmodel) head(airbnb_tokens) head(airbnb_chunks) ## Add the entity of the chunk to the tokenised dataset x <- merge(airbnb_chunks, airbnb_tokens) x[, c("doc_id", "token", "chunk_entity")] table(x$chunk_entity) } ## cleanup for CRAN file.remove(udmodel$file_model)
library(udpipe) udmodel <- udpipe_download_model("dutch-lassysmall") if(packageVersion("udpipe") >= "0.7"){ data(airbnb_chunks, package = "crfsuite") airbnb_chunks <- head(airbnb_chunks, 20) airbnb_tokens <- unique(airbnb_chunks[, c("doc_id", "text")]) airbnb_tokens <- udpipe(airbnb_tokens, object = udmodel) head(airbnb_tokens) head(airbnb_chunks) ## Add the entity of the chunk to the tokenised dataset x <- merge(airbnb_chunks, airbnb_tokens) x[, c("doc_id", "token", "chunk_entity")] table(x$chunk_entity) } ## cleanup for CRAN file.remove(udmodel$file_model)
Download training data for doing Named Entity Recognition (NER)
ner_download_modeldata( type = c("conll2002-nl", "conll2002-es", "GermanNER", "wikiner-de-wp2", "wikiner-de-wp3", "wikiner-en-wp2", "wikiner-en-wp3", "wikiner-es-wp2", "wikiner-es-wp3", "wikiner-fr-wp2", "wikiner-fr-wp3", "wikiner-it-wp2", "wikiner-it-wp3", "wikiner-nl-wp2", "wikiner-nl-wp3", "wikiner-pl-wp3", "wikiner-pt-wp3", "wikiner-ru-wp2", "wikiner-ru-wp3"), docs = -Inf )
ner_download_modeldata( type = c("conll2002-nl", "conll2002-es", "GermanNER", "wikiner-de-wp2", "wikiner-de-wp3", "wikiner-en-wp2", "wikiner-en-wp3", "wikiner-es-wp2", "wikiner-es-wp3", "wikiner-fr-wp2", "wikiner-fr-wp3", "wikiner-it-wp2", "wikiner-it-wp3", "wikiner-nl-wp2", "wikiner-nl-wp3", "wikiner-pl-wp3", "wikiner-pt-wp3", "wikiner-ru-wp2", "wikiner-ru-wp3"), docs = -Inf )
type |
a character string with the type of data to download. See the function usage for all possible values. These data will be downloaded from either:
Please visit the information on these repositories first before you use these data in any commercial product. |
docs |
integer indicating how many documents to sample from the data (only used for data from the NLTK repository). This is only used to reduce CRAN R CMD check training time in the examples of this R package. |
a data.frame with training data for a Named Entity Recognition task or an object of try-error in case of failure of downloading the data
## Not run: x <- ner_download_modeldata("conll2002-nl") x <- ner_download_modeldata("conll2002-es") x <- ner_download_modeldata("GermanNER") x <- ner_download_modeldata("wikiner-en-wp2") x <- ner_download_modeldata("wikiner-nl-wp3") x <- ner_download_modeldata("wikiner-fr-wp3") ## End(Not run) ## reduce number of docs x <- ner_download_modeldata("conll2002-es", docs = 10)
## Not run: x <- ner_download_modeldata("conll2002-nl") x <- ner_download_modeldata("conll2002-es") x <- ner_download_modeldata("GermanNER") x <- ner_download_modeldata("wikiner-en-wp2") x <- ner_download_modeldata("wikiner-nl-wp3") x <- ner_download_modeldata("wikiner-fr-wp3") ## End(Not run) ## reduce number of docs x <- ner_download_modeldata("conll2002-es", docs = 10)
Predict the label sequence based on the Conditional Random Field
## S3 method for class 'crf' predict( object, newdata, embeddings, group, type = c("marginal", "sequence"), trace = FALSE, ... )
## S3 method for class 'crf' predict( object, newdata, embeddings, group, type = c("marginal", "sequence"), trace = FALSE, ... )
object |
an object of class crf as returned by |
newdata |
a character matrix of data containing attributes about the label sequence |
embeddings |
a matrix with the same number of rows as |
group |
an integer or character vector of the same length as nrow |
type |
either 'marginal' or 'sequence' to get predictions at the level of |
trace |
a logical indicating to show the trace of the labelling output. Defaults to |
... |
not used |
If type
is 'marginal': a data.frame with columns label and marginal containing the viterbi decoded predicted label and marginal probability.
If type
is 'sequence': a data.frame with columns group and probability containing for each sequence group the probability of the sequence.
library(udpipe) data(airbnb_chunks, package = "crfsuite") udmodel <- udpipe_download_model("dutch-lassysmall") udmodel <- udpipe_load_model(udmodel$file_model) airbnb_tokens <- unique(airbnb_chunks[, c("doc_id", "text")]) airbnb_tokens <- udpipe_annotate(udmodel, x = airbnb_tokens$text, doc_id = airbnb_tokens$doc_id) airbnb_tokens <- as.data.frame(airbnb_tokens) x <- merge(airbnb_chunks, airbnb_tokens) x <- crf_cbind_attributes(x, terms = c("upos", "lemma"), by = "doc_id") model <- crf(y = x$chunk_entity, x = x[, grep("upos|lemma", colnames(x))], group = x$doc_id, method = "lbfgs", options = list(max_iterations = 5)) scores <- predict(model, newdata = x[, grep("upos|lemma", colnames(x))], group = x$doc_id, type = "marginal") head(scores) scores <- predict(model, newdata = x[, grep("upos|lemma", colnames(x))], group = x$doc_id, type = "sequence") head(scores) ## cleanup for CRAN file.remove(model$file_model) file.remove("modeldetails.txt") file.remove(udmodel$file)
library(udpipe) data(airbnb_chunks, package = "crfsuite") udmodel <- udpipe_download_model("dutch-lassysmall") udmodel <- udpipe_load_model(udmodel$file_model) airbnb_tokens <- unique(airbnb_chunks[, c("doc_id", "text")]) airbnb_tokens <- udpipe_annotate(udmodel, x = airbnb_tokens$text, doc_id = airbnb_tokens$doc_id) airbnb_tokens <- as.data.frame(airbnb_tokens) x <- merge(airbnb_chunks, airbnb_tokens) x <- crf_cbind_attributes(x, terms = c("upos", "lemma"), by = "doc_id") model <- crf(y = x$chunk_entity, x = x[, grep("upos|lemma", colnames(x))], group = x$doc_id, method = "lbfgs", options = list(max_iterations = 5)) scores <- predict(model, newdata = x[, grep("upos|lemma", colnames(x))], group = x$doc_id, type = "marginal") head(scores) scores <- predict(model, newdata = x[, grep("upos|lemma", colnames(x))], group = x$doc_id, type = "sequence") head(scores) ## cleanup for CRAN file.remove(model$file_model) file.remove("modeldetails.txt") file.remove(udmodel$file)
Extract basic text features which are useful for entity recognition
txt_feature( x, type = c("is_capitalised", "is_url", "is_email", "is_number", "prefix", "suffix", "shape"), n = 4 )
txt_feature( x, type = c("is_capitalised", "is_url", "is_email", "is_number", "prefix", "suffix", "shape"), n = 4 )
x |
a character vector |
type |
a character string, which can be one of 'is_capitalised', 'is_url', 'is_email', 'is_number', 'prefix', 'suffix', 'shape' |
n |
for type 'prefix' or 'suffix', the number of characters of the prefix/suffix |
For type 'is_capitalised', 'is_url', 'is_email', 'is_number': a logical vector of the same length as x
, indicating if x
is capitalised, a url, an email or a number
For type 'prefix', 'suffix': a character vector of the same length as x
, containing the prefix or suffix n
number of characters of x
For type 'shape': a character vector of the same length as x
, where lowercased elements are replaced with x and uppercased elements with X
txt_feature("Red Devils", type = "is_capitalised") txt_feature("red devils", type = "is_capitalised") txt_feature("http://www.bnosac.be", type = "is_url") txt_feature("[email protected]", type = "is_email") txt_feature("hi there", type = "is_email") txt_feature("1230000", type = "is_number") txt_feature("123.15", type = "is_number") txt_feature("123,15", type = "is_number") txt_feature("123abc", type = "is_number") txt_feature("abcdefghijklmnopqrstuvwxyz", type = "prefix", n = 3) txt_feature("abcdefghijklmnopqrstuvwxyz", type = "suffix", n = 3) txt_feature("Red Devils", type = "shape") txt_feature("red devils", type = "shape")
txt_feature("Red Devils", type = "is_capitalised") txt_feature("red devils", type = "is_capitalised") txt_feature("http://www.bnosac.be", type = "is_url") txt_feature("[email protected]", type = "is_email") txt_feature("hi there", type = "is_email") txt_feature("1230000", type = "is_number") txt_feature("123.15", type = "is_number") txt_feature("123,15", type = "is_number") txt_feature("123abc", type = "is_number") txt_feature("abcdefghijklmnopqrstuvwxyz", type = "prefix", n = 3) txt_feature("abcdefghijklmnopqrstuvwxyz", type = "suffix", n = 3) txt_feature("Red Devils", type = "shape") txt_feature("red devils", type = "shape")
NA
friendly version of sprintfDoes the same as the function sprintf
except that if
in ... NA
values are passed, also NA
values are returned instead of being replaced by the character string 'NA'
.
txt_sprintf(fmt, ...)
txt_sprintf(fmt, ...)
fmt |
a character vector of format strings, which will be fed on to |
... |
values to be passed into |
The same as what sprintf
returns:
a character vector of length that of the longest input in ...
.
Except, in case any of the values passed on to ...
are NA
,
the corresponding returned value will be set to NA
for that element of the vector.
See the examples to see the difference with sprintf
sprintf("(w-1):%s", c("xyz", NA, "abc")) txt_sprintf("(w-1):%s", c("xyz", NA, "abc")) sprintf("(w-1):%s_%s", c("xyz", NA, "abc"), c(NA, "xyz", "abc")) txt_sprintf("(w-1):%s_%s", c("xyz", NA, "abc"), c(NA, "xyz", "abc"))
sprintf("(w-1):%s", c("xyz", NA, "abc")) txt_sprintf("(w-1):%s", c("xyz", NA, "abc")) sprintf("(w-1):%s_%s", c("xyz", NA, "abc"), c(NA, "xyz", "abc")) txt_sprintf("(w-1):%s_%s", c("xyz", NA, "abc"), c(NA, "xyz", "abc"))