udpipe - Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit
This natural language processing toolkit provides language-agnostic 'tokenization', 'parts of speech tagging', 'lemmatization' and 'dependency parsing' of raw text. Next to text parsing, the package also allows you to train annotation models based on data of 'treebanks' in 'CoNLL-U' format as provided at <https://universaldependencies.org/format.html>. The techniques are explained in detail in the paper: 'Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe', available at <doi:10.18653/v1/K17-3009>. The toolkit also contains functionalities for commonly used data manipulations on texts which are enriched with the output of the parser. Namely functionalities and algorithms for collocations, token co-occurrence, document term matrix handling, term frequency inverse document frequency calculations, information retrieval metrics (Okapi BM25), handling of multi-word expressions, keyword detection (Rapid Automatic Keyword Extraction, noun phrase extraction, syntactical patterns) sentiment scoring and semantic similarity analysis.
Last updated
conlldependency-parserlemmatizationnatural-language-processingnlppos-taggingr-pkgrcpptext-miningtokenizerudpipecpp
12.50 score 221 stars 8 dependents 1.3k scripts 6.5k downloadsword2vec - Distributed Representations of Words
Learn vector representations of words by continuous bag of words and skip-gram implementations of the 'word2vec' algorithm. The techniques are detailed in the paper "Distributed Representations of Words and Phrases and their Compositionality" by Mikolov et al. (2013), available at <doi:10.48550/arXiv.1310.4546>.
Last updated
embeddingsnatural-language-processingword2veccpp
8.54 score 73 stars 7 dependents 311 scripts 1.5k downloadstextrank - Summarize Text by Ranking Sentences and Finding Keywords
The 'textrank' algorithm is an extension of the 'Pagerank' algorithm for text. The algorithm allows to summarize text by calculating how sentences are related to one another. This is done by looking at overlapping terminology used in sentences in order to set up links between sentences. The resulting sentence network is next plugged into the 'Pagerank' algorithm which identifies the most important sentences in your text and ranks them. In a similar way 'textrank' can also be used to extract keywords. A word network is constructed by looking if words are following one another. On top of that network the 'Pagerank' algorithm is applied to extract relevant words after which relevant words which are following one another are combined to get keywords. More information can be found in the paper from Mihalcea, Rada & Tarau, Paul (2004) <https://www.aclweb.org/anthology/W04-3252/>.
Last updated
natural-language-processingnlptextranktextrank-algorithm
7.45 score 77 stars 2 dependents 122 scripts 549 downloadscronR - Schedule R Scripts and Processes with the 'cron' Job Scheduler
Create, edit, and remove 'cron' jobs on your unix-alike system. The package provides a set of easy-to-use wrappers to 'crontab'. It also provides an RStudio add-in to easily launch and schedule your scripts.
Last updated
cronrstudioscheduler
7.30 score 290 stars 228 scripts 414 downloadsBTM - Biterm Topic Models for Short Text
Biterm Topic Models find topics in collections of short texts. It is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns which are called biterms. This in contrast to traditional topic models like Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis which are word-document co-occurrence topic models. A biterm consists of two words co-occurring in the same short text window. This context window can for example be a twitter message, a short answer on a survey, a sentence of a text or a document identifier. The techniques are explained in detail in the paper 'A Biterm Topic Model For Short Text' by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng (2013) <https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf>.
Last updated
biterm-topic-modellingnatural-language-processingtopic-modelingcpp
6.33 score 96 stars 89 scripts 447 downloadsspark.sas7bdat - Read in 'SAS' Data ('.sas7bdat' Files) into 'Apache Spark'
Read in 'SAS' Data ('.sas7bdat' Files) into 'Apache Spark' from R. 'Apache Spark' is an open source cluster computing framework available at <http://spark.apache.org>. This R package uses the 'spark-sas7bdat' 'Spark' package (<https://spark-packages.org/package/saurfang/spark-sas7bdat>) to import and process 'SAS' data in parallel using 'Spark'. Hereby allowing to execute 'dplyr' statements in parallel on top of 'SAS' data.
Last updated
sas7bdatsparksparklyr
5.73 score 28 stars 23 scripts 1.7k downloadstopicmodels.etm - Topic Modelling in Embedding Spaces
Find topics in texts which are semantically embedded using techniques like word2vec or Glove. This topic modelling technique models each word with a categorical distribution whose natural parameter is the inner product between a word embedding and an embedding of its assigned topic. The techniques are explained in detail in the paper 'Topic Modeling in Embedding Spaces' by Adji B. Dieng, Francisco J. R. Ruiz, David M. Blei (2019), available at <doi:10.48550/arXiv.1907.04907>.
Last updated
embeddingsldatopic-modelingword-embeddingsword2vec
5.38 score 51 stars 19 scripts 234 downloadsimage.libfacedetection - Convolutional Neural Network for Face Detection
An open source library for face detection in images. Provides a pretrained convolutional neural network based on <https://github.com/ShiqiYu/libfacedetection> which can be used to detect faces which have size greater than 10x10 pixels.
Last updated
canny-edge-detectioncomputer-visioncontoursdarknetdlibf9harris-cornersharris-interest-point-detectorhog-featuresimage-algorithmsimage-recognitionopenpanootsusurfcppopenmp
5.33 score 285 stars 15 scripts 246 downloadsimage.CornerDetectionHarris - Implementation of the Harris Corner Detection for Images
An implementation of the Harris Corner Detection as described in the paper "An Analysis and Implementation of the Harris Corner Detector" by Sánchez J. et al (2018) available at <doi:10.5201/ipol.2018.229>. The package allows to detect relevant points in images which are characteristic to the digital image.
Last updated
canny-edge-detectioncomputer-visioncontoursdarknetdlibf9harris-cornersharris-interest-point-detectorhog-featuresimage-algorithmsimage-recognitionopenpanootsusurfcppopenmp
5.15 score 285 stars 2 scripts 172 downloadsimage.CannyEdges - Implementation of the Canny Edge Detector for Images
An implementation of the Canny Edge Detector for detecting edges in images. The package provides an interface to the algorithm available at <https://github.com/Neseb/canny>.
Last updated
canny-edge-detectioncomputer-visioncontoursdarknetdlibf9harris-cornersharris-interest-point-detectorhog-featuresimage-algorithmsimage-recognitionopenpanootsusurffftw3cpp
5.15 score 285 stars 6 scripts 201 downloadsimage.Otsu - Otsu's Image Segmentation Method
An implementation of the Otsu's Image Segmentation Method described in the paper: "A C++ Implementation of Otsu's Image Segmentation Method". The algorithm is explained at <doi:10.5201/ipol.2016.158>.
Last updated
canny-edge-detectioncomputer-visioncontoursdarknetdlibf9harris-cornersharris-interest-point-detectorhog-featuresimage-algorithmsimage-recognitionopenpanootsusurfcpp
5.15 score 285 stars 225 downloadsimage.LineSegmentDetector - Detect Line Segments in Images
An implementation of the Line Segment Detector on digital images described in the paper: "LSD: A Fast Line Segment Detector with a False Detection Control" by Rafael Grompone von Gioi et al (2012). The algorithm is explained at <doi:10.5201/ipol.2012.gjmr-lsd>.
Last updated
canny-edge-detectioncomputer-visioncontoursdarknetdlibf9harris-cornersharris-interest-point-detectorhog-featuresimage-algorithmsimage-recognitionopenpanootsusurfcpp
5.15 score 285 stars 7 scripts 250 downloadsimage.ContourDetector - Implementation of the Unsupervised Smooth Contour Line Detection for Images
An implementation of the Unsupervised Smooth Contour Detection algorithm for digital images as described in the paper: "Unsupervised Smooth Contour Detection" by Rafael Grompone von Gioi, and Gregory Randall (2016). The algorithm is explained at <doi:10.5201/ipol.2016.175>.
Last updated
canny-edge-detectioncomputer-visioncontoursdarknetdlibf9harris-cornersharris-interest-point-detectorhog-featuresimage-algorithmsimage-recognitionopenpanootsusurfcpp
5.15 score 285 stars 7 scripts 251 downloadsimage.CornerDetectionF9 - Find Corners in Digital Images with FAST-9
An implementation of the "FAST-9" corner detection algorithm explained in the paper 'FASTER and better: A machine learning approach to corner detection' by Rosten E., Porter R. and Drummond T. (2008), available at <doi:10.48550/arXiv.0810.2434>. The package allows to detect corners in digital images.
Last updated
canny-edge-detectioncomputer-visioncontoursdarknetdlibf9harris-cornersharris-interest-point-detectorhog-featuresimage-algorithmsimage-recognitionopenpanootsusurfcpp
5.15 score 285 stars 7 scripts 213 downloadstokenizers.bpe - Byte Pair Encoding Text Tokenization
Unsupervised text tokenizer focused on computational efficiency. Wraps the 'YouTokenToMe' library <https://github.com/VKCOM/YouTokenToMe> which is an implementation of fast Byte Pair Encoding (BPE) <https://aclanthology.org/P16-1162/>.
Last updated
bpebyte-pair-encodingtext-miningtokenizationcpp
4.99 score 16 stars 61 scripts 745 downloadssentencepiece - Text Tokenization using Byte Pair Encoding and Unigram Modelling
Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>. Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>.
Last updated
bytenatural-language-processingsentencepieceword-segmentationcpp
4.50 score 29 stars 11 scripts 318 downloadsnametagger - Named Entity Recognition in Texts using 'NameTag'
Wraps the 'nametag' library <https://github.com/ufal/nametag>, allowing users to find and extract entities (names, persons, locations, addresses, ...) in raw text and build your own entity recognition models. Based on a maximum entropy Markov model which is described in Strakova J., Straka M. and Hajic J. (2013) <https://ufal.mff.cuni.cz/~straka/papers/2013-tsd_ner.pdf>.
Last updated
nercpp
4.26 score 12 stars 10 scripts 227 downloads