Package: doc2vec 0.2.1
doc2vec: Distributed Representations of Sentences, Documents and Topics
Learn vector representations of sentences, paragraphs or documents by using the 'Paragraph Vector' algorithms, namely the distributed bag of words ('PV-DBOW') and the distributed memory ('PV-DM') model. The techniques in the package are detailed in the paper "Distributed Representations of Sentences and Documents" by Mikolov et al. (2014), available at <arxiv:1405.4053>. The package also provides an implementation to cluster documents based on these embedding using a technique called top2vec. Top2vec finds clusters in text documents by combining techniques to embed documents and words and density-based clustering. It does this by embedding documents in the semantic space as defined by the 'doc2vec' algorithm. Next it maps these document embeddings to a lower-dimensional space using the 'Uniform Manifold Approximation and Projection' (UMAP) clustering algorithm and finds dense areas in that space using a 'Hierarchical Density-Based Clustering' technique (HDBSCAN). These dense areas are the topic clusters which can be represented by the corresponding topic vector which is an aggregate of the document embeddings of the documents which are part of that topic cluster. In the same semantic space similar words can be found which are representative of the topic. More details can be found in the paper 'Top2Vec: Distributed Representations of Topics' by D. Angelov available at <arxiv:2008.09470>.
Authors:
doc2vec_0.2.1.tar.gz
doc2vec_0.2.1.zip(r-4.5)doc2vec_0.2.1.zip(r-4.4)doc2vec_0.2.1.zip(r-4.3)
doc2vec_0.2.1.tgz(r-4.4-x86_64)doc2vec_0.2.1.tgz(r-4.4-arm64)doc2vec_0.2.1.tgz(r-4.3-x86_64)doc2vec_0.2.1.tgz(r-4.3-arm64)
doc2vec_0.2.1.tar.gz(r-4.5-noble)doc2vec_0.2.1.tar.gz(r-4.4-noble)
doc2vec_0.2.1.tgz(r-4.4-emscripten)doc2vec_0.2.1.tgz(r-4.3-emscripten)
doc2vec.pdf |doc2vec.html✨
doc2vec/json (API)
NEWS
# Install 'doc2vec' in R: |
install.packages('doc2vec', repos = c('https://bnosac.r-universe.dev', 'https://cloud.r-project.org')) |
Bug tracker:https://github.com/bnosac/doc2vec/issues
- be_parliament_2020 - Corpus with Questions asked in the Belgium Federal Parliament in 2020
doc2vecembeddingsnatural-language-processingparagraph2vecword2vec
Last updated 3 years agofrom:9b40740efc. Checks:OK: 1 NOTE: 7 ERROR: 1. Indexed: yes.
Target | Result | Date |
---|---|---|
Doc / Vignettes | OK | Nov 06 2024 |
R-4.5-win-x86_64 | NOTE | Nov 06 2024 |
R-4.5-linux-x86_64 | NOTE | Nov 06 2024 |
R-4.4-win-x86_64 | NOTE | Nov 06 2024 |
R-4.4-mac-x86_64 | NOTE | Nov 06 2024 |
R-4.4-mac-aarch64 | NOTE | Nov 06 2024 |
R-4.3-win-x86_64 | NOTE | Nov 06 2024 |
R-4.3-mac-x86_64 | NOTE | Nov 06 2024 |
R-4.3-mac-aarch64 | ERROR | Nov 06 2024 |
Exports:paragraph2vecparagraph2vec_similarityread.paragraph2vectop2vectxt_count_wordswrite.paragraph2vec
Dependencies:Rcpp
Readme and manuals
Help Manual
Help page | Topics |
---|---|
Get the document or word vectors of a paragraph2vec model | as.matrix.paragraph2vec |
Corpus with Questions asked in the Belgium Federal Parliament in 2020 | be_parliament_2020 |
Train a paragraph2vec also known as doc2vec model on text | paragraph2vec |
Similarity between document / word vectors as used in paragraph2vec | paragraph2vec_similarity |
Predict functionalities for a paragraph2vec model | predict.paragraph2vec |
Read a binary paragraph2vec model from disk | read.paragraph2vec |
Get summary information of a top2vec model | summary.top2vec |
Distributed Representations of Topics | top2vec |
Count the number of spaces occurring in text | txt_count_words |
Update a Top2vec model | update.top2vec |
Save a paragraph2vec model to disk | write.paragraph2vec |