Meta Guide Book Bots (VagabotKB)

Corpus linguistics & Concgramming in Verbots and Pandorabots


N-grams are sequences of n consecutive words; whereas, concgrams may include any co-occuring n words regardless of position. Concgrams allow for gaps in the n-grams and order variation too. Concgrams may also be known as gappy or skipping n-grams aka skip-grams (s-grams). Latent Semantic Indexing seems very similar to concgramming. The average number of words in an English sentence is around 16, and the average number of words in a tweet (140 characters) is 13. When parsing sentences from books, sentences may be numbered consecutively in order to calculate proximity. When multiple books are being used the sentence number may be pre-pended by ISBN in order to differentiate between volumes. This in effect assigns each sentence its own GUID (Globally Unique IDentifier).

Resources (Corpus):

Adobe FrameMaker (converts books/ePub into XML) | Altova XMLSpy (XML editor & XSLT processor) | ConcApp Concordancer | ConcGram List Builder | Google Books Ngram Viewer (About) | Microsoft Web N-gram Service (unigram, bigram, trigram, N-gram with N=4) | TopBraid Composer (converts XML into RDF) | WSConcGram (program for finding concgrams, essentially related pairs, triplets, quadruplets)

Resources (Sentence):

GATE ANNIE Sentence Splitter | GENIA Sentence Splitter | JTextPro (Sentence boundary detection) | JULIE Lab Sentence Boundary Detector (JSBD) | Lingua-Sentence-1.03 | MorphAdorner Sentence Splitter | Mozart SentenceSplitter | NLTK Punkt Sentence Tokenizer | OpenNLP Sentence Detector | RASP4UIMA SentenceSplitter | Sentence and paragraph breaker | TextPro Sentence Splitter

Resources (Chatlogs):

Google Refine | TextPad | ConcGram1



See also:

ConcGrams | N-gram Dialog Systems | N-gram Transducers | Sentence Extractor | Sentence Grammaticality | Sentence Parsers & Dialog Systems | Sentence Recognizer | Sentence Splitter 2011


Step 1) Stock personality: Julia

1) global name change


Step 2) Original corpus: Vagabond Globetrotting 3 (2004)

1) concgram corpus

2) parse sentences

3) key sentences to concgrams

4) XSLT transform to knowledgebase


Step 3) 2011 Questionbase: 12 months of questions

1) aggregate questions

2) filter duplicates

3) delete too few words (<1)

4) delete too many words (>32)

5) delete rows beginning with spaces, characters, numericals

6) google refine, case & clustering

7) gr export (CSV), remove duplicates

8) limit character length to 140

9) remove foreign lanaguages

10) remove miscellaneous characters (TextPad)

11) concordance (ConcGram1)