N~grams offer fast language independent multi-class text categorization. Text is reduced in a single pass to ngram vectors. These are assigned to one of several classes by a) nearest neighbour (KNN) and b) genetic algorithm operating on weights in a nearest neighbour classifier. 91 accuracy is found on binary classification on short multi-author technical English documents. This falls if more categories are used but 69 is obtained with 8~classes. Zipf law is found not to apply to trigrams.

, , ,
Software Engineering [SEN]

Langdon, W.B. (2000). Natural language text classification and filtering with trigrams and evolutionary nearest neighbour classifiers. Software Engineering [SEN]. CWI.