2014
Strategies to Increase Accuracy in Text Classification
Publication
Publication
Text classification via supervised learning involves various
steps from processing raw data, features extraction to training
and validating classifiers. Within these steps implementation
decisions are critical to the resulting classifier accuracy.
This paper contains a report of the study performed to
determine the optimum parameter setup for reaching the
highest possible accuracy when classifying multilingual (Dutch
and English) user profiles, collected from social media, with job
titles, with the goal of improving the matches between job
vacancies and user profiles in a case for HR recruitment. The
study includes experiments with eleven labels (job titles), a
shifting pivot between test, and training-datasets, the use of
combined n-grams, feature extraction methods: bag of words
(BOW), word frequency or count (WC) and word importance
(via TF-IDF), the use of tagged words corpora with POS tags,
and the use of seven well-known classification algorithms. Two
Support Vector Machine (SVM) systems, two Naive-Bayes (NB)
approaches, two Maximum-Entropy classifiers, and one
Decisions Tree (DT). Seven experiments were performed, with
a combined total of about 1900 training, and test runs. The
used dataset contains of 95,000 profiles that were annotated
with eleven job title labels, using a tool specially developed for
this purpose.
We concluded that classifiers based on the Support Vector
Machine (SVM) achieved the highest classification accuracy (up
to 93% with 7-labels). Feature extraction methods of (1,2,3)-
grams, and word frequency/ importance showed the highest
accuracy gain among all classifiers. The most profound
accuracy gain was achieved by excluding labels that contained
too generic features. The SVM classifiers reached their
accuracy ceiling on 2/3 of the experiments already. By further
studies into annotating and removing non-specific information it
is believed this accuracy figure can be increase even more.
Additional Metadata | |
---|---|
T. van der Storm (Tijs) | |
Organisation | Software Analysis and Transformation |
Blommesteijn, D. (2014, January). Strategies to Increase Accuracy in Text Classification. |