Install the R package.
Get your language model and start annotating.
udmodel <- udpipe_load_model(file = udmodel$file_model)
x <- udpipe_annotate(udmodel,
x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.")
x <- as.data.frame(x, detailed = TRUE)
Or just do as follows.
library(udpipe)
x <- udpipe(x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.",
object = "dutch")
The annotation returns paragraphs, sentences, tokens, the location of the token in the original text, morphology elements like the lemma, the universal part of speech tag and the treebank-specific parts of speech tag, morphosyntactic features and returns as well the dependency relationship. More information at https://universaldependencies.org/guidelines.html
'data.frame': 18 obs. of 17 variables:
$ doc_id : chr "doc1" "doc1" "doc1" "doc1" ...
$ paragraph_id : int 1 1 1 1 1 1 1 1 1 1 ...
$ sentence_id : int 1 1 1 1 1 1 1 1 1 2 ...
$ sentence : chr "Ik ging op reis en ik nam mee:" "Ik ging op reis en ik nam mee:" "Ik ging op reis en ik nam mee:" "Ik ging op reis en ik nam mee:" ...
$ start : int 1 4 9 12 17 20 23 27 30 32 ...
$ end : int 2 7 10 15 18 21 25 29 30 35 ...
$ term_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ token_id : chr "1" "2" "3" "4" ...
$ token : chr "Ik" "ging" "op" "reis" ...
$ lemma : chr "ik" "gaan" "op" "reis" ...
$ upos : chr "PRON" "VERB" "ADP" "NOUN" ...
$ xpos : chr "VNW|pers|pron|nomin|vol|1|ev" "WW|pv|verl|ev" "VZ|init" "N|soort|ev|basis|zijd|stan" ...
$ feats : chr "Case=Nom|Person=1|PronType=Prs" "Number=Sing|Tense=Past|VerbForm=Fin" NA "Gender=Com|Number=Sing" ...
$ head_token_id: chr "2" "0" "4" "2" ...
$ dep_rel : chr "nsubj" "root" "case" "obl" ...
$ deps : chr NA NA NA NA ...
$ misc : chr NA NA NA NA ...
Mark that it is important that the x
argument to
udpipe_annotate
is in UTF-8 encoding. You can check the
encoding of your text with Encoding('your text')
. You can
convert your text to UTF-8, using standard R utilities: as in
iconv('your text', from = 'latin1', to = 'UTF-8')
where you
replace the from
part with whichever encoding you have your
text in, possible your computers default as defined in
localeToCharset()
. So annotation would look something like
this if your text is not already in UTF-8 encoding:
udpipe_annotate(udmodel, x = iconv('your text', to = 'UTF-8'))
if your text is in the encoding of the current locale of your
computer.udpipe_annotate(udmodel, x = iconv('your text', from = 'latin1', to = 'UTF-8'))
if your text is in latin1 encoding.udpipe_annotate(udmodel, x = iconv('your text', from = 'CP949', to = 'UTF-8'))
if your text is in CP949 encoding.