Package: udpipe 0.8.11

Jan Wijffels

udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

This natural language processing toolkit provides language-agnostic 'tokenization', 'parts of speech tagging', 'lemmatization' and 'dependency parsing' of raw text. Next to text parsing, the package also allows you to train annotation models based on data of 'treebanks' in 'CoNLL-U' format as provided at <https://universaldependencies.org/format.html>. The techniques are explained in detail in the paper: 'Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe', available at <doi:10.18653/v1/K17-3009>. The toolkit also contains functionalities for commonly used data manipulations on texts which are enriched with the output of the parser. Namely functionalities and algorithms for collocations, token co-occurrence, document term matrix handling, term frequency inverse document frequency calculations, information retrieval metrics (Okapi BM25), handling of multi-word expressions, keyword detection (Rapid Automatic Keyword Extraction, noun phrase extraction, syntactical patterns) sentiment scoring and semantic similarity analysis.

Authors:Jan Wijffels [aut, cre, cph], BNOSAC [cph], Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic [cph], Milan Straka [ctb, cph], Jana Straková [ctb, cph]

udpipe_0.8.11.tar.gz
udpipe_0.8.11.zip(r-4.5)udpipe_0.8.11.zip(r-4.4)udpipe_0.8.11.zip(r-4.3)
udpipe_0.8.11.tgz(r-4.5-x86_64)udpipe_0.8.11.tgz(r-4.5-arm64)udpipe_0.8.11.tgz(r-4.4-x86_64)udpipe_0.8.11.tgz(r-4.4-arm64)udpipe_0.8.11.tgz(r-4.3-x86_64)udpipe_0.8.11.tgz(r-4.3-arm64)
udpipe_0.8.11.tar.gz(r-4.5-noble)udpipe_0.8.11.tar.gz(r-4.4-noble)
udpipe_0.8.11.tgz(r-4.4-emscripten)udpipe_0.8.11.tgz(r-4.3-emscripten)
udpipe.pdf |udpipe.html✨
udpipe/json (API)
NEWS

# Install 'udpipe' in R:

install.packages('udpipe', repos = c('https://bnosac.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/bnosac/udpipe/issues

Uses libs:

c++– GNU Standard C++ Library v3

Datasets:

brussels_listings - Brussels AirBnB address locations available at www.insideairbnb.com
brussels_reviews - Reviews of AirBnB customers on Brussels address locations available at www.insideairbnb.com
brussels_reviews_anno - Reviews of the AirBnB customers which are tokenised, POS tagged and lemmatised
brussels_reviews_w2v_embeddings_lemma_nl - An example matrix of word embeddings
udpipe_annotation_params - List with training options set by the UDPipe community when building models based on the Universal Dependencies data

On CRAN:

conll dependency-parser lemmatization natural-language-processing nlp pos-tagging r-pkg rcpp text-mining tokenizer udpipe cpp

11.83 score 215 stars 9 packages 1.2k scripts 5.6k downloads 5 mentions 62 exports 4 dependencies

Last updated 2 years agofrom:6a974c52fe. Checks:1 OK, 11 NOTE. Indexed: yes.

Target	Result	Latest binary
Doc / Vignettes	OK	Mar 06 2025
R-4.5-win-x86_64	NOTE	Mar 06 2025
R-4.5-mac-x86_64	NOTE	Mar 06 2025
R-4.5-mac-aarch64	NOTE	Mar 06 2025
R-4.5-linux-x86_64	NOTE	Mar 06 2025
R-4.4-win-x86_64	NOTE	Mar 06 2025
R-4.4-mac-x86_64	NOTE	Mar 06 2025
R-4.4-mac-aarch64	NOTE	Mar 06 2025
R-4.4-linux-x86_64	NOTE	Mar 06 2025
R-4.3-win-x86_64	NOTE	Mar 06 2025
R-4.3-mac-x86_64	NOTE	Mar 06 2025
R-4.3-mac-aarch64	NOTE	Mar 06 2025

Exports:as_conllu as_cooccurrence as_fasttext as_phrasemachine as_word2vec cbind_dependencies cbind_morphological collocation cooccurrence document_term_frequencies document_term_frequencies_statistics document_term_matrix dtm_align dtm_cbind dtm_chisq dtm_colsums dtm_conform dtm_cor dtm_rbind dtm_remove_lowfreq dtm_remove_sparseterms dtm_remove_terms dtm_remove_tfidf dtm_reverse dtm_rowsums dtm_sample dtm_svd_similarity dtm_tfidf keywords_collocation keywords_phrases keywords_rake paste.data.frame phrases strsplit.data.frame txt_collapse txt_contains txt_context txt_count txt_freq txt_grepl txt_highlight txt_next txt_nextgram txt_overlap txt_paste txt_previous txt_previousgram txt_recode txt_recode_ngram txt_sample txt_sentiment txt_show txt_tagsequence udpipe udpipe_accuracy udpipe_annotate udpipe_download_model udpipe_load_model udpipe_read_conllu udpipe_train unique_identifier unlist_tokens

Dependencies:data.table lattice Matrix Rcpp

UDPipe Natural Language Processing - Text Annotation

Jan Wijffels

Rendered fromudpipe-annotation.Rmdusingknitr::rmarkdownon Mar 06 2025.

Last update: 2021-06-01
Started: 2017-08-30

UDPipe Natural Language Processing - Basic Analytical Use Cases

Jan Wijffels

Rendered fromudpipe-usecase-postagging-lemmatisation.Rmdusingknitr::rmarkdownon Mar 06 2025.

Last update: 2021-06-01
Started: 2018-02-06

UDPipe Natural Language Processing - Model Building

Jan Wijffels

Rendered fromudpipe-train.Rmdusingknitr::rmarkdownon Mar 06 2025.

Last update: 2021-06-01
Started: 2017-08-31

UDPipe Natural Language Processing - Parallel

Jan Wijffels

Rendered fromudpipe-parallel.Rmdusingknitr::rmarkdownon Mar 06 2025.

Last update: 2021-06-01
Started: 2019-05-17

UDPipe Natural Language Processing - Topic Modelling Use Cases

Jan Wijffels

Rendered fromudpipe-usecase-topicmodelling.Rmdusingknitr::rmarkdownon Mar 06 2025.

Last update: 2021-06-01
Started: 2018-03-06

UDPipe Natural Language Processing - Try it out

Jan Wijffels

Rendered fromudpipe-tryitout.Rmdusingknitr::rmarkdownon Mar 06 2025.

Last update: 2020-10-09
Started: 2018-01-15

UDPipe Natural Language Processing - Universe

Jan Wijffels

Rendered fromudpipe-universe.Rmdusingknitr::rmarkdownon Mar 06 2025.

Last update: 2021-12-02
Started: 2020-10-09

Help page	Topics
Convert a data.frame to CONLL-U format	as_conllu
Convert a matrix to a co-occurrence data.frame	as_cooccurrence
Combine labels and text as used in fasttext	as_fasttext
Convert Parts of Speech tags to one-letter tags which can be used to identify phrases based on regular expressions	as_phrasemachine
Convert a matrix of word vectors to word2vec format	as_word2vec
Convert the result of udpipe_annotate to a tidy data frame	as.data.frame.udpipe_connlu
Convert the result of cooccurrence to a sparse matrix	as.matrix.cooccurrence
Brussels AirBnB address locations available at www.insideairbnb.com	brussels_listings
Reviews of AirBnB customers on Brussels address locations available at www.insideairbnb.com	brussels_reviews
Reviews of the AirBnB customers which are tokenised, POS tagged and lemmatised	brussels_reviews_anno
An example matrix of word embeddings	brussels_reviews_w2v_embeddings_lemma_nl
Add the dependency parsing information to an annotated dataset	cbind_dependencies
Add morphological features to an annotated dataset	cbind_morphological
Create a cooccurence data.frame	cooccurrence cooccurrence.character cooccurrence.cooccurrence cooccurrence.data.frame
Aggregate a data.frame to the document/term level by calculating how many times a term occurs per document	document_term_frequencies document_term_frequencies.character document_term_frequencies.data.frame
Add Term Frequency, Inverse Document Frequency and Okapi BM25 statistics to the output of document_term_frequencies	document_term_frequencies_statistics
Create a document/term matrix	document_term_matrix document_term_matrix.data.frame document_term_matrix.default document_term_matrix.DocumentTermMatrix document_term_matrix.integer document_term_matrix.matrix document_term_matrix.numeric document_term_matrix.simple_triplet_matrix document_term_matrix.TermDocumentMatrix
Reorder a Document-Term-Matrix alongside a vector or data.frame	dtm_align
Combine 2 document term matrices either by rows or by columns	dtm_bind dtm_cbind dtm_rbind
Compare term usage across 2 document groups using the Chi-square Test for Count Data	dtm_chisq
Column sums and Row sums for document term matrices	dtm_colsums dtm_rowsums
Make sure a document term matrix has exactly the specified rows and columns	dtm_conform
Pearson Correlation for Sparse Matrices	dtm_cor
Remove terms occurring with low frequency from a Document-Term-Matrix and documents with no terms	dtm_remove_lowfreq
Remove terms with high sparsity from a Document-Term-Matrix	dtm_remove_sparseterms
Remove terms from a Document-Term-Matrix and keep only documents which have a least some terms	dtm_remove_terms
Remove terms from a Document-Term-Matrix and documents with no terms based on the term frequency inverse document frequency	dtm_remove_tfidf
Inverse operation of the document_term_matrix function	dtm_reverse
Random samples and permutations from a Document-Term-Matrix	dtm_sample
Semantic Similarity to a Singular Value Decomposition	dtm_svd_similarity
Term Frequency - Inverse Document Frequency calculation	dtm_tfidf
Extract collocations - a sequence of terms which follow each other	collocation keywords_collocation
Extract phrases - a sequence of terms which follow each other based on a sequence of Parts of Speech tags	keywords_phrases phrases
Keyword identification using Rapid Automatic Keyword Extraction (RAKE)	keywords_rake
Concatenate text of each group of data together	paste.data.frame
Predict method for an object of class LDA_VEM or class LDA_Gibbs	predict.LDA predict.LDA_Gibbs predict.LDA_VEM
Obtain a tokenised data frame by splitting text alongside a regular expression	strsplit.data.frame
Experimental and undocumented querying of syntax patterns	syntaxpatterns syntaxpatterns-class
Experimental and undocumented querying of syntax relationships	&,logical,syntaxrelation-method &,syntaxrelation,logical-method syntaxrelation syntaxrelation-class \|,logical,syntaxrelation-method \|,syntaxrelation,logical-method
Collapse a character vector while removing missing data.	txt_collapse
Check if text contains a certain pattern	txt_contains
Based on a vector with a word sequence, get n-grams (looking forward + backward)	txt_context
Count the number of times a pattern is occurring in text	txt_count
Frequency statistics of elements in a vector	txt_freq
Look up a multiple patterns and indicate their presence in text	txt_grepl
Highlight words in a character vector	txt_highlight
Get the n-th next element of a vector	txt_next
Based on a vector with a word sequence, get n-grams (looking forward)	txt_nextgram
Get the overlap between 2 vectors	txt_overlap
Concatenate strings with options how to handle missing data	txt_paste
Get the n-th previous element of a vector	txt_previous
Based on a vector with a word sequence, get n-grams (looking backward)	txt_previousgram
Recode text to other categories	txt_recode
Recode words with compound multi-word expressions	txt_recode_ngram
Boilerplate function to sample one element from a vector.	txt_sample
Perform dictionary-based sentiment analysis on a tokenised data frame	txt_sentiment
Boilerplate function to cat only 1 element of a character vector.	txt_show
Identify a contiguous sequence of tags as 1 being entity	txt_tagsequence
Tokenising, Lemmatising, Tagging and Dependency Parsing of raw text in TIF format	udpipe
Evaluate the accuracy of your UDPipe model on holdout data	udpipe_accuracy
Tokenising, Lemmatising, Tagging and Dependency Parsing Annotation of raw text	udpipe_annotate
List with training options set by the UDPipe community when building models based on the Universal Dependencies data	udpipe_annotation_params
Download an UDPipe model provided by the UDPipe community for a specific language of choice	udpipe_download_model
Load an UDPipe model	udpipe_load_model
Read in a CONLL-U file as a data.frame	udpipe_read_conllu
Train a UDPipe model	udpipe_train
Create a unique identifier for each combination of fields in a data frame	unique_identifier
Create a data.frame from a list of tokens	unlist_tokens

Package: udpipe 0.8.11

udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

UDPipe Natural Language Processing - Text Annotation

UDPipe Natural Language Processing - Basic Analytical Use Cases

UDPipe Natural Language Processing - Model Building

UDPipe Natural Language Processing - Parallel

UDPipe Natural Language Processing - Topic Modelling Use Cases

UDPipe Natural Language Processing - Try it out

UDPipe Natural Language Processing - Universe

Citation

Development and contributors

Readme and manuals

Help Manual

Usage by other packages (reverse dependencies)