--- title: "UDPipe Natural Language Processing - Try it out" author: "Jan Wijffels" date: "`r Sys.Date()`" output: html_document: fig_caption: false toc: false vignette: > %\VignetteIndexEntry{UDPipe Natural Language Processing - Try it out} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE, cache=FALSE} options(width = 1000) knitr::opts_chunk$set(echo = TRUE, message = FALSE, comment = NA, eval = TRUE) ``` Install the R package. ```{r, eval=FALSE} install.packages("udpipe") ``` ## Example Get your language model and start annotating. ```{r, results='hide'} library(udpipe) udmodel <- udpipe_download_model(language = "dutch") ``` ```{r, echo=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, comment = NA, eval = !udmodel$download_failed) ``` ```{r, results='hide'} udmodel <- udpipe_load_model(file = udmodel$file_model) x <- udpipe_annotate(udmodel, x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.") x <- as.data.frame(x, detailed = TRUE) ``` Or just do as follows. ```{r, results='hide', eval=FALSE} library(udpipe) x <- udpipe(x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.", object = "dutch") ``` The annotation returns paragraphs, sentences, tokens, the location of the token in the original text, morphology elements like the lemma, the universal part of speech tag and the treebank-specific parts of speech tag, morphosyntactic features and returns as well the dependency relationship. More information at https://universaldependencies.org/guidelines.html ```{r} str(x) ``` ## A small note on encodings Mark that it is important that the `x` argument to `udpipe_annotate` is in UTF-8 encoding. You can check the encoding of your text with `Encoding('your text')`. You can convert your text to UTF-8, using standard R utilities: as in `iconv('your text', from = 'latin1', to = 'UTF-8')` where you replace the `from` part with whichever encoding you have your text in, possible your computers default as defined in `localeToCharset()`. So annotation would look something like this if your text is not already in UTF-8 encoding: - `udpipe_annotate(udmodel, x = iconv('your text', to = 'UTF-8'))` if your text is in the encoding of the current locale of your computer. - `udpipe_annotate(udmodel, x = iconv('your text', from = 'latin1', to = 'UTF-8'))` if your text is in latin1 encoding. - `udpipe_annotate(udmodel, x = iconv('your text', from = 'CP949', to = 'UTF-8'))` if your text is in CP949 encoding. ```{r, results='hide', echo=FALSE} invisible(if(file.exists(udmodel$file)) file.remove(udmodel$file)) ```