Notes:
A document-term matrix is a mathematical matrix that represents the frequency of terms (words or phrases) that occur in a collection of documents. It is commonly used in natural language processing and information retrieval to analyze and understand the content of a set of documents.
Doc2mat is a software tool that performs lexical analysis and transformation on a collection of documents, and outputs the results in the form of a document-term matrix. It is designed to preserve word-oriented tokens and document segmentation, allowing users to analyze the content of the documents at a granular level.
CLUTO is a software tool for clustering high-dimensional datasets. It is designed to work with large datasets that have many dimensions, and can be accessed from Matlab using the readCluto function. CLUTO is often used in combination with a document-term matrix to cluster documents based on their content, allowing users to analyze and understand the relationships between different documents.
A high-dimensional dataset is a dataset that has a large number of dimensions or variables. In the context of data analysis, a dimension is a feature or attribute of a data point, and a dataset is said to be high-dimensional if it has a large number of dimensions.
For example, a dataset that contains information about customer demographics and purchase history might have dimensions such as age, gender, income, location, and product categories. A dataset that contains information about the weather might have dimensions such as temperature, humidity, wind speed, and precipitation.
High-dimensional datasets can be challenging to analyze and understand, as they often contain a large amount of complex and interrelated data. Specialized techniques and tools, such as dimensionality reduction algorithms and visualization techniques, are often used to help analyze and understand high-dimensional datasets.
Wikipedia:
- Document-term matrix
- tf-idf (term frequency–inverse document frequency)
See also:
Improving document clustering using Okapi BM25 feature weighting JS Whissell, CLA Clarke – Information Retrieval, 2011 – Springer … The vectors for each dataset could be created from base documents using a simple script called doc2mat.2 Conceptually, doc2mat performs the following operations to generate the vectors for the datasets: (1) all non-alphanumeric characters are converted to whitespace; (2 … Cited by 16 Related articles All 7 versions
Phenotype mining for functional genomics and gene discovery P Groth, U Leser, B Weiss – In Silico Tools for Gene Discovery, 2011 – Springer … 2. To further prepare textual phenotype descriptions for clustering (see Sections 3.2 and 3.3), the doc2mat program was downloaded from http://glaros.dtc.umn. edu/gkhome/fetch/sw/cluto/doc2mat-1.0.tar.gz. … Cited by 5 Related articles All 5 versions
Semantic Features from Web-Traffic Streams S Hutchinson – Network Science and Cybersecurity, 2014 – Springer … DOC2MAT—performs lexical analysis, transformation, and output representation in (T:F) vectors, preserving word-oriented tokens and document segmentation. DOC2MAT is a re-implementation of the doc2mat.pl utility provided … Related articles All 2 versions
Visualizing Research Patterns in the Field of e-Learning MS Khan, M Ebner, B Taraghi – researchgate.net … found in [10]. In this study, instead of using keywords or classification assigned to papers by Ed-Media, we are extracting dictionary of terms ourselves using an open source utility named doc2mat [11] from papers titles. We did … Related articles
News analysis through text mining: a case study IC Mogotsi – VINE, 2007 – emeraldinsight.com … 2003). CLUTO comes with a useful utility, doc2mat, which converts raw documents into the data matrices that can be readily used for clustering. … approach. The doc2mat utility comes with an in?built stop list that can be used directly. … Cited by 6 Related articles All 3 versions
Improving word coverage using unsupervised morphological analyser KVN Sunitha, N Kalyani – Sadhana, 2009 – Springer … To use CLUTO tool the data should be converted to matrix form which is done by using doc2mat tool is used. 2.2a Algorithm for step two: … Note: DOC2MAT tool was modified to retain the cases of suffixes unchanged (usually DOC2MAT changes all words in to small case). … Cited by 1 Related articles All 15 versions
Automating the Analysis of Freetext Answers to Openended Questions Y Lin, P Halfpenny, J Gibson, F Tekiner, P Bennett… – personalpages.manchester.ac.uk … Once the responses had been divided up into phrases, they were converted to the vectorspace format needed by the clustering software that was to be used at the textmining stage by means of a utility called doc2mat (http://glaros.dtc.umn.edu/gkhome/files/fs/sw/cluto/doc2mat … Related articles
Research trends in the field of e-learning from 2003 to 2008: A scientometric and content analysis for selected journals and conferences using visualization H Maurer, M Salman Khan – Interactive Technology and Smart …, 2010 – emeraldinsight.com … 2.3 Identification of topics based on concepts clusters In order to identify the research topics in the field of e?learning, we used paper titles and abstracts and employed open?source doc2mat (2008) utility to convert the documents in a vector space format. … Cited by 7 Related articles All 3 versions
A New Approach for Clustering Variable Length Documents N Kumar, K Srinathan – Advance Computing Conference, 2009. …, 2009 – ieeexplore.ieee.org … doc2mat Perl script (available with CLUTO [15]) and, (2) … All documents are converted to creating vector space model and (2) By using document vector ASCII text form and then author specified keyphrases are model generated by CLUTO’s doc2mat script approach. removed. …
Evaluation of Partitional Algorithms for Clustering Medical Documents OIE Mohamed, FH Saad, EA Mohamed – Citeseer … The words considered being same words if they share the same stem. The class labels of all different data sets are generated by Doc2Mat [10]. … 27, pp. 379-423. 1948. [10] George Karypis. Doc2Mat: Converting Documents into vector- space format. Program. [11] MF Porter. … Related articles All 3 versions
TMG: A MATLAB toolbox for generating term-document matrices from text collections D Zeimpekis, E Gallopoulos – Grouping multidimensional data, 2006 – Springer … Other tools that one can find in the open literature are Doc2mat [3], written in perl and developed in the context of the cluto IR package [257]; mc [8], written in c++ [129,136]; and the Unix shell script utility countallwords included in the pddp package [77]. … Cited by 110 Related articles All 9 versions
Design of a MATLAB toolbox for term-document matrix generation D Zeimpekis, E Gallopoulos – Proceedings of the Workshop on Clustering …, 2005 – Citeseer … The Telcordia LSI Engine, the General Text Parser (GTP) ([2, 18]), the PGTP (an MPI-based parallel version of GTP), the DOC2MAT [1] developed in the context of the CLUTO IR package [21] and the scripts mc [5, 16] and countallwords (included in the PDDP package [13]), are … Cited by 31 Related articles All 6 versions
Identification of Bilingual Segments for Translation Generation KK Mahesh, L Gomes, JGP Lopes – Advances in Intelligent Data Analysis …, 2014 – Springer … 19]. In the experiments presented here, partition approach was adapted for clus- tering. To prepare the data for clustering, the doc2mat11 tool is used, which provides the necessary conversion of data into matrix form. We experimented … Related articles All 3 versions
Estimating missing features to improve multimedia retrieval A Bagherjeiran, NS Love… – Image Processing, 2007. …, 2007 – ieeexplore.ieee.org … Text Features Given a caption, we extract the text features using the script, doc2mat [2], which removes common words and finds the root words or tokens in thecaption. For example, the cap- tion “President George W. Bush…” becomes the set of tokens w = {presi, georg, bush}. … Cited by 1 Related articles
Predicting Gene Function using an Integrated Similarity Graph BM Malone, AD Perkins – cs.helsinki.fi … PhenomicDB contains phenotypes from many sources [5] Searchable by Entrez gene symbol Text available for all mapped genes turned into tf-idf vector doc2mat utility from CLUTO [7] 6541 distinct terms Similarity function: Absolute value of cosine distance GO Annotations … Related articles
Identification of Bilingual Suffix Classes for Classification and Translation Generation KM Kavitha, L Gomes, JGP Lopes – Advances in Artificial Intelligence– …, 2014 – Springer … and analysed in [19]. In the experiments presented here, partition approach was adopted for clustering. The doc2mat6 tool provides the necessary conversion of data to be clustered into matrix form. We applied the clustering … Related articles All 3 versions
Mining Opinion-Clusters from Very Large Unstructured Real-World Textual Data J Zizka, K Burda, F Darena – AIMSA, 2012 – akela.mendelu.cz … We have used the scripts below to transform and select desired data: doc2mat.pl Script that transforms raw data into vector represenation and ma- trices. It is intended for usage with Cluto.1 binaryRepre.pl Script that uses Cluto’s TF representation as an input. … Cited by 2 Related articles
The fudan-uiuc participation in the bioasq challenge task 2a: The antinomyra system K Liu, J Wu, S Peng, C Zhai, S Zhu – Risk, 2014 – ceur-ws.org … Candidates 6 See http://sifaka.cs.uiuc.edu/jiang4/software/BioTokenizer.pl 7 See http://nlp.stanford.edu/software/corenlp.shtml 8 See http://glaros.dtc.umn.edu/gkhome/files/fs/ sw/cluto/doc2mat.html 9 See http://www.ncbi.nlm.nih.gov/books/NBK25499/ 1314 Page 5. … Cited by 1 Related articles
Biomedical ontology mesh improves document clustering qualify on medline articles: A comparison study I Yoo, X Hu – Computer-Based Medical Systems, 2006. CBMS …, 2006 – ieeexplore.ieee.org … of the seven document clustering approaches, as shown in Figure 2. We provide all the clustering algorithms except Suffix Tree Clustering (STC) with both word*document matrixes and concept*document matrixes as inputs that are generated by doc2mat Perl script1. … Cited by 12 Related articles All 12 versions
A Comparison of Two Document Clustering Approaches for Clustering Medical Documents. FH Saad, B de la Iglesia, DG Bell – DMIN, 2006 – researchgate.net … The information contained in this report is extremely valuable for clinical purposes but difficult to handle with standard data mining techniques due to the lack of structure. The class labels of all different document sets are generated by Doc2Mat [25]. … Cited by 11 Related articles All 6 versions
A novel approach to improve rule based Telugu morphological analyzer KVN Sunitha, N Kalyani – Nature & Biologically Inspired …, 2009 – ieeexplore.ieee.org … divides the set into two groups and repeats until the number of clusters is equal to the specified number. To use CLUTO tool the data should be converted to matrix form which is done by using doc2mat tool which converts the document data to matrix form. … Cited by 3 Related articles
Tuple–Information Visualization Publications Browser A Gukov – cs.ubc.ca … Page 8. input we applied a Perl script doc2mat [15] to produce a word occurrence matrix compatible with CLUTO. As a preprocessing step the script removed stop words (general use words which do not contribute to the meaning). … Related articles All 7 versions
Comparison Of Hierarchical Agglomerative Algorithms For Clustering Medical Documents FH Saad, OIE Mohamed… – … Journal of Software …, 2012 – publications.rafa-elayyan.net … The words considered being same words if they share the same stem. The class labels of all different data sets are generated by Doc2Mat [13]. The largest data set contains 3151 documents and the smallest data set contains 2105 documents. Table 1. The data sets summery. … Related articles All 8 versions
Estimating Missing Features to Improve Multimedia Information Retrieval A Bagherjeiran, NS Love, C Kamath – 2006 – e-reports-ext.llnl.gov … Page 5. reducing the word to its root form. This is done using Porter’s stemming algorithm [25] via the doc2mat script described by Karypis et al. in [15]. This script removes common words and finds the root words or tokens in the caption. … Related articles All 8 versions
On the Evaluation of Data Clustering JS Herbach – 2008 – cs.princeton.edu Page 1. On the Evaluation of Data Clustering Joshua S. Herbach Department of Computer Science Princeton University herbach@cs.princeton.edu May 2008 Page 2. A Computer Science BSE Thesis Prof. Andrea S. LaPaugh, Advisor … Related articles
Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering I Yoo, X Hu, IY Song – Proceedings of the 12th ACM SIGKDD …, 2006 – dl.acm.org … clustering quality and scalability. We provided all the clustering algorithms except STC and COBRA with (word*document) matrixes (ie vector representation) as input that are generated by doc2mat Perl script [31]. For STC, we … Cited by 36 Related articles All 9 versions
The Use of Latent Semantic Indexing to Cluster Documents into Their Subject Areas R Antai, C Fox, U Kruschwitz – 2011 – repository.essex.ac.uk … only, without ?rst performing LSA. The document collection was converted straight into a matrix, using the doc2mat feature of CLUTO, preprocessing was carried out where stop words were removed. This was done using the … Cited by 2 Related articles All 3 versions
Biomedical ontology improves biomedical literature clustering performance: a comparison study I Yoo, X Hu, IY Song – International journal of bioinformatics research …, 2007 – Inderscience … using four different clustering evaluation metrics (as mentioned above), as shown in Table 4. We provide all the clustering algorithms except STC with both (word*document) matrixes and (ontology concept*document) matrixes as inputs that are generated by doc2mat Perl script … Cited by 13 Related articles All 7 versions
What did they cover?: a cluster analysis of news stories published in the Botswana Daily News, January–December 2004 IC Mogotsi – 2005 – scholar.sun.ac.za Page 1. Copyright © 2005 University of Stellenbosch All rights reserved. Page 2. Declaration I, the undersigned, hereby declare that the work contained in this assignment is my own original work and that I have not previously … Related articles All 2 versions
A concept based indexing approach for document clustering S Barresi, S Nefti, Y Rezgui – Semantic Computing, 2008 IEEE …, 2008 – ieeexplore.ieee.org … The document indexing approaches used are the traditional bag of words – TF based indexing (TF_I), which was obtained through the use of CLUTO’s doc2mat utility, and the novel common concept based indexing approach (CConc_I). … Cited by 2 Related articles All 5 versions
Detecting weak signals by internet-based environmental scanning N Tabatabaei – 2011 – uwspace.uwaterloo.ca … 31 4.5 Pre-processing Phase ….. 32 4.6 Doc2mat File ….. 32 4.7 CLUTO ….. … Cited by 4 Related articles All 4 versions
Integrating phenotype and gene expression data for predicting gene function BM Malone, AD Perkins, SM Bridges – BMC bioinformatics, 2009 – biomedcentral.com … The document associated with each symbol was transformed into a tf-idf array. The doc2mat utility from the CLUTO package [19] applies a stop word list and the Porter stemming algorithm to produce a term frequency description of each document [13]. … Cited by 5 Related articles All 22 versions
A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE I Yoo, X Hu – Proceedings of the 6th ACM/IEEE-CS joint conference …, 2006 – dl.acm.org … In addition, we provide the clustering approaches with as input both word*document matrixes (ie vector representation) that are generated by doc2mat Perl script 3 and concept*document matrixes. For STC, we input both a word string and a concept … Cited by 44 Related articles All 12 versions
Unsupervised Stemmer to Improve Rule Based Morph Analyzer KVN Sunitha, N Kalyani – mirlabs.org … To use CLUTO tool the data should be converted to matrix form which is done by using doc2mat tool which converts the document data to matrix form. 4.3 Clustering Procedure 1. Drop all the stems that occur below a frequency count of 2 in the entire corpus. … Related articles
Interactive web caching for slow or intermittent networks J Chen, L Subramanian – Proceedings of the 4th Annual Symposium on …, 2013 – dl.acm.org … The content topic extraction was implemented by first collecting all files with text/html MIME type and removing HTML tags using the C# HTMLAgilityPack package [28]. Then, the doc2mat [22] utility was used to convert to- kens into the cluto format. … Cited by 2 Related articles All 8 versions
Sparse linear methods with side information for top-n recommendations X Ning, G Karypis – Proceedings of the sixth ACM conference on …, 2012 – dl.acm.org … TFIDF representation. Then the highest weighted words were selected until cumulatively they contribute to 90% of the vector length. 1http://glaros.dtc.umn.edu/ gkhome/fetch/sw/cluto/doc2mat- 1.0.tar.gz Normalized TFIDF Binary … Cited by 13 Related articles All 8 versions
Scoring and summarising gene product clusters using the Gene Ontology SC Denaxas, C Tjortjis – International journal of data mining and …, 2008 – Inderscience … terms length. Automatic indexing of the profiles as well as stop word elimination was performed by using the doc2mat script; a part of the CLUTO toolkit (Karypis et al., 2004). 3.3 Quantifying biological similarity Similarity between … Cited by 5 Related articles All 14 versions
Knowledge management and discovery for genotype/phenotype data MSBP Groth – 2009 – Citeseer Page 1. Knowledge Management and Discovery for Genotype/Phenotype Data Dissertation zur Erlangung des akademischen Grades doctor rerum naturalium (Dr. rer. nat.) im Fach Informatik eingereicht an der Mathematisch-Naturwissenschaftlichen Fakultät II von M.Sc. Bioinf. … Related articles All 4 versions
Mining phenotypes for gene function prediction P Groth, B Weiss, HD Pohlenz, U Leser – BMC bioinformatics, 2008 – biomedcentral.com … results in textual comparison and clustering. We stemmed all words using the stemming algorithm from the doc2mat package supplied with the clustering toolkit CLUTO v2.1.1 by Zhao and Karypis [44]. We also removed so-called … Cited by 40 Related articles All 14 versions
Analysis of collaborative writing processes using revision maps and probabilistic topic models V Southavilay, K Yacef, P Reimann… – Proceedings of the Third …, 2013 – dl.acm.org Page 1. Analysis of Collaborative Writing Processes Using Revision Maps and Probabilistic Topic Models Vilaythong Southavilay*, Kalina Yacef*, Peter Reimann+, Rafael A. Calvo^ *School of Information Technologies University … Cited by 14 Related articles All 3 versions
Effective document clustering for large heterogeneous law firm collections JG Conrad, K Al-Kofahi, Y Zhao, G Karypis – Proceedings of the 10th …, 2005 – dl.acm.org Page 1. Effective Document Clustering for Large Heterogeneous Law Firm Collections Jack G. Conrad and Khalid Al-Kofahi Research & Development Department Thomson Legal & Regulatory St. Paul, Minnesota 55123 USA {Jack.G.Conrad, Khalid.Al-Kofahi}@Thomson.com … Cited by 36 Related articles All 13 versions
Evaluating Clusterings by Estimating Clarity JS Whissell – 2012 – uwspace.uwaterloo.ca Page 1. Evaluating Clusterings by Estimating Clarity by John Samuel Whissell A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer Science Waterloo, Ontario, Canada, 2012 … Cited by 1 Related articles All 4 versions
Semantic text mining and its application in biomedical domain I Yoo – 2006 – idea.library.drexel.edu Page 1. Semantic Text Mining and its Application in Biomedical Domain A Thesis Submitted to the Faculty of Drexel University by Illhoi Yoo in partial fulfillment of the requirements for the degree of Doctor of Philosophy June 2006 Page 2. © Copyright 2006 Illhoi Yoo. … Cited by 8 Related articles All 4 versions
Machine Learning and Data Mining Methods for Recommender Systems and Chemical Informatics X Ning – 2012 – conservancy.umn.edu Page 1. Machine Learning and Data Mining Methods for Recommender Systems and Chemical Informatics A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Xia Ning … Related articles All 4 versions