Package: doc2vec 0.2.1

Jan Wijffels

doc2vec: Distributed Representations of Sentences, Documents and Topics

Learn vector representations of sentences, paragraphs or documents by using the 'Paragraph Vector' algorithms, namely the distributed bag of words ('PV-DBOW') and the distributed memory ('PV-DM') model. The techniques in the package are detailed in the paper "Distributed Representations of Sentences and Documents" by Mikolov et al. (2014), available at <arxiv:1405.4053>. The package also provides an implementation to cluster documents based on these embedding using a technique called top2vec. Top2vec finds clusters in text documents by combining techniques to embed documents and words and density-based clustering. It does this by embedding documents in the semantic space as defined by the 'doc2vec' algorithm. Next it maps these document embeddings to a lower-dimensional space using the 'Uniform Manifold Approximation and Projection' (UMAP) clustering algorithm and finds dense areas in that space using a 'Hierarchical Density-Based Clustering' technique (HDBSCAN). These dense areas are the topic clusters which can be represented by the corresponding topic vector which is an aggregate of the document embeddings of the documents which are part of that topic cluster. In the same semantic space similar words can be found which are representative of the topic. More details can be found in the paper 'Top2Vec: Distributed Representations of Topics' by D. Angelov available at <arxiv:2008.09470>.

Authors:Jan Wijffels [aut, cre, cph], BNOSAC [cph], hiyijian [ctb, cph]

doc2vec_0.2.1.tar.gz
doc2vec_0.2.1.zip(r-4.5)doc2vec_0.2.1.zip(r-4.4)doc2vec_0.2.1.zip(r-4.3)
doc2vec_0.2.1.tgz(r-4.5-x86_64)doc2vec_0.2.1.tgz(r-4.5-arm64)doc2vec_0.2.1.tgz(r-4.4-x86_64)doc2vec_0.2.1.tgz(r-4.4-arm64)doc2vec_0.2.1.tgz(r-4.3-x86_64)doc2vec_0.2.1.tgz(r-4.3-arm64)
doc2vec_0.2.1.tar.gz(r-4.5-noble)doc2vec_0.2.1.tar.gz(r-4.4-noble)
doc2vec_0.2.1.tgz(r-4.4-emscripten)doc2vec_0.2.1.tgz(r-4.3-emscripten)
doc2vec.pdf |doc2vec.html✨
doc2vec/json (API)
NEWS

# Install 'doc2vec' in R:

install.packages('doc2vec', repos = c('https://bnosac.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/bnosac/doc2vec/issues

Uses libs:

c++– GNU Standard C++ Library v3

Datasets:

be_parliament_2020 - Corpus with Questions asked in the Belgium Federal Parliament in 2020

On CRAN:

doc2vec embeddings natural-language-processing paragraph2vec word2vec cpp

5.74 score 48 stars 23 scripts 923 downloads 27 mentions 6 exports 1 dependencies

Last updated 3 years agofrom:9b40740efc. Checks:1 OK, 11 NOTE. Indexed: yes.

Target	Result	Latest binary
Doc / Vignettes	OK	Mar 06 2025
R-4.5-win-x86_64	NOTE	Mar 06 2025
R-4.5-mac-x86_64	NOTE	Mar 06 2025
R-4.5-mac-aarch64	NOTE	Mar 06 2025
R-4.5-linux-x86_64	NOTE	Mar 06 2025
R-4.4-win-x86_64	NOTE	Mar 06 2025
R-4.4-mac-x86_64	NOTE	Mar 06 2025
R-4.4-mac-aarch64	NOTE	Mar 06 2025
R-4.4-linux-x86_64	NOTE	Mar 06 2025
R-4.3-win-x86_64	NOTE	Mar 06 2025
R-4.3-mac-x86_64	NOTE	Mar 06 2025
R-4.3-mac-aarch64	NOTE	Mar 06 2025

Exports:paragraph2vec paragraph2vec_similarity read.paragraph2vec top2vec txt_count_words write.paragraph2vec

Dependencies:Rcpp

Help page	Topics
Get the document or word vectors of a paragraph2vec model	as.matrix.paragraph2vec
Corpus with Questions asked in the Belgium Federal Parliament in 2020	be_parliament_2020
Train a paragraph2vec also known as doc2vec model on text	paragraph2vec
Similarity between document / word vectors as used in paragraph2vec	paragraph2vec_similarity
Predict functionalities for a paragraph2vec model	predict.paragraph2vec
Read a binary paragraph2vec model from disk	read.paragraph2vec
Get summary information of a top2vec model	summary.top2vec
Distributed Representations of Topics	top2vec
Count the number of spaces occurring in text	txt_count_words
Update a Top2vec model	update.top2vec
Save a paragraph2vec model to disk	write.paragraph2vec

Package: doc2vec 0.2.1

doc2vec: Distributed Representations of Sentences, Documents and Topics

Citation

Development and contributors

Readme and manuals

Help Manual

Usage by other packages (reverse dependencies)