Natural language text classification and filtering with trigrams and evolutionary nearest neighbour classifiers
N~grams offer fast language independent multi-class text categorization. Text is reduced in a single pass to ngram vectors. These are assigned to one of several classes by a) nearest neighbour (KNN) and b) genetic algorithm operating on weights in a nearest neighbour classifier. 91 accuracy is found on binary classification on short multi-author technical English documents. This falls if more categories are used but 69 is obtained with 8~classes. Zipf law is found not to apply to trigrams.
|Optimization (acm G.1.6), Combinatorics (acm G.2.1), Learning (acm I.2.6), Problem Solving, Control Methods, and Search (acm I.2.8)|
|Learning and adaptive systems (msc 68T05), Problem solving (heuristics, search strategies, etc.) (msc 68T20)|
|Software Engineering [SEN]|
Langdon, W.B. (2000). Natural language text classification and filtering with trigrams and evolutionary nearest neighbour classifiers. Software Engineering [SEN]. CWI.