Notes:
N-grams are sequences of n consecutive words; whereas, concgrams may include any co-occuring n words regardless of position. Concgrams allow for gaps in the n-grams and order variation too. Concgrams may also be known as gappy or skipping n-grams aka skip-grams (s-grams). Latent Semantic Indexing seems very similar to concgramming. The average number of words in an English sentence is around 16, and the average number of words in a tweet (140 characters) is 13. When parsing sentences from books, sentences may be numbered consecutively in order to calculate proximity. When multiple books are being used the sentence number may be pre-pended by ISBN in order to differentiate between volumes. This in effect assigns each sentence its own GUID (Globally Unique IDentifier).
Resources (Corpus):
- Adobe FrameMaker (converts books/ePub into XML)
- Altova XMLSpy (XML editor & XSLT processor)
- ConcApp Concordancer
- ConcGram List Builder
- Google Books Ngram Viewer (About)
- Microsoft Web N-gram Service (unigram, bigram, trigram, N-gram with N=4)
- TopBraid Composer (converts XML into RDF)
- WSConcGram (program for finding concgrams, essentially related pairs, triplets, quadruplets)
Resources (Sentence):
- GATE ANNIE Sentence Splitter
- GENIA Sentence Splitter
- JTextPro (Sentence boundary detection)
- JULIE Lab Sentence Boundary Detector (JSBD)
- Lingua-Sentence-1.03
- MorphAdorner Sentence Splitter
- Mozart SentenceSplitter
- NLTK Punkt Sentence Tokenizer
- OpenNLP Sentence Detector
- RASP4UIMA SentenceSplitter
- Sentence and paragraph breaker
- TextPro Sentence Splitter
Resources (Chatlogs):
Wikipedia:
References:
- Keyness in Texts (2010)
- ConcGram 1.0: A Phraseological Search Engine (2009)
- Corpus linguistics & Concgramming in Verbots and Pandorabots (2008)
- From n-gram to skipgram to concgram (2006)
- Statistical parsing of English sentences (2006)
See also:
Automatic Book Generation | ConcGrams | N-gram Dialog Systems | N-gram Transducers | Sentence Extractor | Sentence Grammaticality | Sentence Parsers & Dialog Systems | Sentence Recognizer
VagabotKB (circa 2013)
Step 1) Stock personality: Julia
1) global name change
Step 2) Original corpus: Vagabond Globetrotting 3 (2004)
1) concgram corpus
2) parse sentences
3) key sentences to concgrams
4) XSLT transform to knowledgebase
Step 3) 2011 Questionbase: 12 months of questions
1) aggregate questions
2) filter duplicates
3) delete too few words (<1)
4) delete too many words (>32)
5) delete rows beginning with spaces, characters, numericals
6) google refine, case & clustering
7) gr export (CSV), remove duplicates
8) limit character length to 140
9) remove foreign lanaguages
10) remove miscellaneous characters (TextPad)
11) concordance (ConcGram1)