2015
Normalized Compression Distance of Multisets with Applications
Publication
Publication
IEEE Transactions on Pattern Analysis and Machine Intelligence , Volume 37 - Issue 8 p. 1602- 1614
Normalized compression distance (NCD) is a parameter-free, feature-free, alignment-free, similarity
measure between a pair of finite objects based on compression. However, it is not sufficient for all
applications. We propose an NCD of finite nonempty multisets (a.k.a. multiples) of finite objects that is
also a metric. Previously, attempts to obtain such an NCD failed. We cover the entire trajectory from
theoretical underpinning to feasible practice. The new NCD for multisets is applied to retinal progenitor
cell classification questions and to related synthetically generated data that were earlier treated with the
pairwise NCD. With the new method we achieved significantly better results. Similarly for questions about
axonal organelle transport. We also applied the new NCD to handwritten digit recognition and improved
classification accuracy significantly over that of pairwise NCD by incorporating both the pairwise and
NCD for multisets. In the analysis we use the incomputable Kolmogorov complexity that for practical
purposes is approximated from above by the length of the compressed version of the file involved, using
a real-world compression program.
Index Terms— Normalized compression distance, multisets or multiples, pattern recognition, data
mining, similarity, classification, Kolmogorov complexity, retinal progenitor cells, synthetic data, organelle
transport, handwritten character recognition
Additional Metadata | |
---|---|
I.E.E.E. Computer Society Press | |
IEEE Transactions on Pattern Analysis and Machine Intelligence | |
Organisation | Directie |
Cohen, A., & Vitányi, P. (2015). Normalized Compression Distance of Multisets with Applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(8), 1602–1614. |