Meta Guide Book Bots


Notes:

N-grams are sequences of n consecutive words; whereas, concgrams may include any co-occuring n words regardless of position. Concgrams allow for gaps in the n-grams and order variation too. Concgrams may also be known as gappy or skipping n-grams aka skip-grams (s-grams). Latent Semantic Indexing seems very similar to concgramming. The average number of words in an English sentence is around 16, and the average number of words in a tweet (140 characters) is 13. When parsing sentences from books, sentences may be numbered consecutively in order to calculate proximity. When multiple books are being used the sentence number may be pre-pended by ISBN in order to differentiate between volumes. This in effect assigns each sentence its own GUID (Globally Unique IDentifier).

Resources (Corpus):

Resources (Sentence):

Resources (Chatlogs):

Wikipedia:

References:

See also:

Automatic Book GenerationConcGrams | N-gram Dialog Systems | N-gram Transducers | Sentence Extractor | Sentence Grammaticality | Sentence Parsers & Dialog Systems | Sentence Recognizer


VagabotKB (circa 2013)

Step 1) Stock personality: Julia

1) global name change

 

Step 2) Original corpus: Vagabond Globetrotting 3 (2004)

1) concgram corpus

2) parse sentences

3) key sentences to concgrams

4) XSLT transform to knowledgebase

 

Step 3) 2011 Questionbase: 12 months of questions

1) aggregate questions

2) filter duplicates

3) delete too few words (<1)

4) delete too many words (>32)

5) delete rows beginning with spaces, characters, numericals

6) google refine, case & clustering

7) gr export (CSV), remove duplicates

8) limit character length to 140

9) remove foreign lanaguages

10) remove miscellaneous characters (TextPad)

11) concordance (ConcGram1)