Package: udpipe 0.8.11

Jan Wijffels

udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

This natural language processing toolkit provides language-agnostic 'tokenization', 'parts of speech tagging', 'lemmatization' and 'dependency parsing' of raw text. Next to text parsing, the package also allows you to train annotation models based on data of 'treebanks' in 'CoNLL-U' format as provided at <https://universaldependencies.org/format.html>. The techniques are explained in detail in the paper: 'Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe', available at <doi:10.18653/v1/K17-3009>. The toolkit also contains functionalities for commonly used data manipulations on texts which are enriched with the output of the parser. Namely functionalities and algorithms for collocations, token co-occurrence, document term matrix handling, term frequency inverse document frequency calculations, information retrieval metrics (Okapi BM25), handling of multi-word expressions, keyword detection (Rapid Automatic Keyword Extraction, noun phrase extraction, syntactical patterns) sentiment scoring and semantic similarity analysis.

Authors:Jan Wijffels [aut, cre, cph], BNOSAC [cph], Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic [cph], Milan Straka [ctb, cph], Jana Straková [ctb, cph]

udpipe_0.8.11.tar.gz
udpipe_0.8.11.zip(r-4.5)udpipe_0.8.11.zip(r-4.4)udpipe_0.8.11.zip(r-4.3)
udpipe_0.8.11.tgz(r-4.4-x86_64)udpipe_0.8.11.tgz(r-4.4-arm64)udpipe_0.8.11.tgz(r-4.3-x86_64)udpipe_0.8.11.tgz(r-4.3-arm64)
udpipe_0.8.11.tar.gz(r-4.5-noble)udpipe_0.8.11.tar.gz(r-4.4-noble)
udpipe_0.8.11.tgz(r-4.4-emscripten)udpipe_0.8.11.tgz(r-4.3-emscripten)
udpipe.pdf |udpipe.html
udpipe/json (API)
NEWS

# Install 'udpipe' in R:
install.packages('udpipe', repos = c('https://bnosac.r-universe.dev', 'https://cloud.r-project.org'))

Peer review:

Bug tracker:https://github.com/bnosac/udpipe/issues

Uses libs:
  • c++– GNU Standard C++ Library v3
Datasets:

On CRAN:

conlldependency-parserlemmatizationnatural-language-processingnlppos-taggingr-pkgrcpptext-miningtokenizerudpipe

11.78 score 214 stars 8 packages 1.1k scripts 6.2k downloads 5 mentions 62 exports 4 dependencies

Last updated 2 years agofrom:6a974c52fe. Checks:OK: 1 NOTE: 8. Indexed: yes.

TargetResultDate
Doc / VignettesOKNov 06 2024
R-4.5-win-x86_64NOTENov 06 2024
R-4.5-linux-x86_64NOTENov 06 2024
R-4.4-win-x86_64NOTENov 06 2024
R-4.4-mac-x86_64NOTENov 06 2024
R-4.4-mac-aarch64NOTENov 06 2024
R-4.3-win-x86_64NOTENov 06 2024
R-4.3-mac-x86_64NOTENov 06 2024
R-4.3-mac-aarch64NOTENov 06 2024

Exports:as_conlluas_cooccurrenceas_fasttextas_phrasemachineas_word2veccbind_dependenciescbind_morphologicalcollocationcooccurrencedocument_term_frequenciesdocument_term_frequencies_statisticsdocument_term_matrixdtm_aligndtm_cbinddtm_chisqdtm_colsumsdtm_conformdtm_cordtm_rbinddtm_remove_lowfreqdtm_remove_sparsetermsdtm_remove_termsdtm_remove_tfidfdtm_reversedtm_rowsumsdtm_sampledtm_svd_similaritydtm_tfidfkeywords_collocationkeywords_phraseskeywords_rakepaste.data.framephrasesstrsplit.data.frametxt_collapsetxt_containstxt_contexttxt_counttxt_freqtxt_grepltxt_highlighttxt_nexttxt_nextgramtxt_overlaptxt_pastetxt_previoustxt_previousgramtxt_recodetxt_recode_ngramtxt_sampletxt_sentimenttxt_showtxt_tagsequenceudpipeudpipe_accuracyudpipe_annotateudpipe_download_modeludpipe_load_modeludpipe_read_conlluudpipe_trainunique_identifierunlist_tokens

Dependencies:data.tablelatticeMatrixRcpp

UDPipe Natural Language Processing - Text Annotation

Rendered fromudpipe-annotation.Rmdusingknitr::rmarkdownon Nov 06 2024.

Last update: 2021-06-01
Started: 2017-08-30

UDPipe Natural Language Processing - Basic Analytical Use Cases

Rendered fromudpipe-usecase-postagging-lemmatisation.Rmdusingknitr::rmarkdownon Nov 06 2024.

Last update: 2021-06-01
Started: 2018-02-06

UDPipe Natural Language Processing - Model Building

Rendered fromudpipe-train.Rmdusingknitr::rmarkdownon Nov 06 2024.

Last update: 2021-06-01
Started: 2017-08-31

UDPipe Natural Language Processing - Parallel

Rendered fromudpipe-parallel.Rmdusingknitr::rmarkdownon Nov 06 2024.

Last update: 2021-06-01
Started: 2019-05-17

UDPipe Natural Language Processing - Topic Modelling Use Cases

Rendered fromudpipe-usecase-topicmodelling.Rmdusingknitr::rmarkdownon Nov 06 2024.

Last update: 2021-06-01
Started: 2018-03-06

UDPipe Natural Language Processing - Try it out

Rendered fromudpipe-tryitout.Rmdusingknitr::rmarkdownon Nov 06 2024.

Last update: 2020-10-09
Started: 2018-01-15

UDPipe Natural Language Processing - Universe

Rendered fromudpipe-universe.Rmdusingknitr::rmarkdownon Nov 06 2024.

Last update: 2021-12-02
Started: 2020-10-09

Readme and manuals

Help Manual

Help pageTopics
Convert a data.frame to CONLL-U formatas_conllu
Convert a matrix to a co-occurrence data.frameas_cooccurrence
Combine labels and text as used in fasttextas_fasttext
Convert Parts of Speech tags to one-letter tags which can be used to identify phrases based on regular expressionsas_phrasemachine
Convert a matrix of word vectors to word2vec formatas_word2vec
Convert the result of udpipe_annotate to a tidy data frameas.data.frame.udpipe_connlu
Convert the result of cooccurrence to a sparse matrixas.matrix.cooccurrence
Brussels AirBnB address locations available at www.insideairbnb.combrussels_listings
Reviews of AirBnB customers on Brussels address locations available at www.insideairbnb.combrussels_reviews
Reviews of the AirBnB customers which are tokenised, POS tagged and lemmatisedbrussels_reviews_anno
An example matrix of word embeddingsbrussels_reviews_w2v_embeddings_lemma_nl
Add the dependency parsing information to an annotated datasetcbind_dependencies
Add morphological features to an annotated datasetcbind_morphological
Create a cooccurence data.framecooccurrence cooccurrence.character cooccurrence.cooccurrence cooccurrence.data.frame
Aggregate a data.frame to the document/term level by calculating how many times a term occurs per documentdocument_term_frequencies document_term_frequencies.character document_term_frequencies.data.frame
Add Term Frequency, Inverse Document Frequency and Okapi BM25 statistics to the output of document_term_frequenciesdocument_term_frequencies_statistics
Create a document/term matrixdocument_term_matrix document_term_matrix.data.frame document_term_matrix.default document_term_matrix.DocumentTermMatrix document_term_matrix.integer document_term_matrix.matrix document_term_matrix.numeric document_term_matrix.simple_triplet_matrix document_term_matrix.TermDocumentMatrix
Reorder a Document-Term-Matrix alongside a vector or data.framedtm_align
Combine 2 document term matrices either by rows or by columnsdtm_bind dtm_cbind dtm_rbind
Compare term usage across 2 document groups using the Chi-square Test for Count Datadtm_chisq
Column sums and Row sums for document term matricesdtm_colsums dtm_rowsums
Make sure a document term matrix has exactly the specified rows and columnsdtm_conform
Pearson Correlation for Sparse Matricesdtm_cor
Remove terms occurring with low frequency from a Document-Term-Matrix and documents with no termsdtm_remove_lowfreq
Remove terms with high sparsity from a Document-Term-Matrixdtm_remove_sparseterms
Remove terms from a Document-Term-Matrix and keep only documents which have a least some termsdtm_remove_terms
Remove terms from a Document-Term-Matrix and documents with no terms based on the term frequency inverse document frequencydtm_remove_tfidf
Inverse operation of the document_term_matrix functiondtm_reverse
Random samples and permutations from a Document-Term-Matrixdtm_sample
Semantic Similarity to a Singular Value Decompositiondtm_svd_similarity
Term Frequency - Inverse Document Frequency calculationdtm_tfidf
Extract collocations - a sequence of terms which follow each othercollocation keywords_collocation
Extract phrases - a sequence of terms which follow each other based on a sequence of Parts of Speech tagskeywords_phrases phrases
Keyword identification using Rapid Automatic Keyword Extraction (RAKE)keywords_rake
Concatenate text of each group of data togetherpaste.data.frame
Predict method for an object of class LDA_VEM or class LDA_Gibbspredict.LDA predict.LDA_Gibbs predict.LDA_VEM
Obtain a tokenised data frame by splitting text alongside a regular expressionstrsplit.data.frame
Experimental and undocumented querying of syntax patternssyntaxpatterns syntaxpatterns-class
Experimental and undocumented querying of syntax relationships&,logical,syntaxrelation-method &,syntaxrelation,logical-method syntaxrelation syntaxrelation-class |,logical,syntaxrelation-method |,syntaxrelation,logical-method
Collapse a character vector while removing missing data.txt_collapse
Check if text contains a certain patterntxt_contains
Based on a vector with a word sequence, get n-grams (looking forward + backward)txt_context
Count the number of times a pattern is occurring in texttxt_count
Frequency statistics of elements in a vectortxt_freq
Look up a multiple patterns and indicate their presence in texttxt_grepl
Highlight words in a character vectortxt_highlight
Get the n-th next element of a vectortxt_next
Based on a vector with a word sequence, get n-grams (looking forward)txt_nextgram
Get the overlap between 2 vectorstxt_overlap
Concatenate strings with options how to handle missing datatxt_paste
Get the n-th previous element of a vectortxt_previous
Based on a vector with a word sequence, get n-grams (looking backward)txt_previousgram
Recode text to other categoriestxt_recode
Recode words with compound multi-word expressionstxt_recode_ngram
Boilerplate function to sample one element from a vector.txt_sample
Perform dictionary-based sentiment analysis on a tokenised data frametxt_sentiment
Boilerplate function to cat only 1 element of a character vector.txt_show
Identify a contiguous sequence of tags as 1 being entitytxt_tagsequence
Tokenising, Lemmatising, Tagging and Dependency Parsing of raw text in TIF formatudpipe
Evaluate the accuracy of your UDPipe model on holdout dataudpipe_accuracy
Tokenising, Lemmatising, Tagging and Dependency Parsing Annotation of raw textudpipe_annotate
List with training options set by the UDPipe community when building models based on the Universal Dependencies dataudpipe_annotation_params
Download an UDPipe model provided by the UDPipe community for a specific language of choiceudpipe_download_model
Load an UDPipe modeludpipe_load_model
Read in a CONLL-U file as a data.frameudpipe_read_conllu
Train a UDPipe modeludpipe_train
Create a unique identifier for each combination of fields in a data frameunique_identifier
Create a data.frame from a list of tokensunlist_tokens