Clustering demographics and sequences of diagnosis codes

Zhong, Haodi; Loukides, Grigorios; Pissis, Solon

doi:10.1109/JBHI.2021.3129461

H. Zhong (Haodi), G. Loukides (Grigorios) and S. Pissis (Solon)

2021-11-19

Clustering demographics and sequences of diagnosis codes

IEEE Journal of Biomedical and Health Informatics , Volume 26 - Issue 5 p. 2351- 2359

A Relational-Sequential dataset (or RS-dataset for short) contains records comprised of a patients values in demographic attributes and their sequence of diagnosis codes. The task of clustering an RS-dataset is helpful for analyses ranging from pattern mining to classification. However, existing methods are not appropriate to perform this task. Thus, we initiate a study of how an RS-dataset can be clustered effectively and efficiently. We formalize the task of clustering an RS-dataset as an optimization problem. At the heart of the problem is a distance measure we design to quantify the pairwise similarity between records of an RS-dataset. Our measure uses a tree structure that encodes hierarchical relationships between records, based on their demographics, as well as an edit-distance-like measure that captures both the sequentiality and the semantic similarity of diagnosis codes. We also develop an algorithm which first identifies k representative records (centers), for a given k, and then constructs clusters, each containing one center and the records that are closer to the center compared to other centers. Experiments using two Electronic Health Record datasets demonstrate that our algorithm constructs compact and well-separated clusters, which preserve meaningful relationships between demographics and sequences of diagnosis codes, while being efficient and scalable.

Additional Metadata
Keywords	Atomic measurements, Task analysis, Semantics, Market research, Diagnosis codes, Demographics, Data mining, Codes, Clustering algorithms, Clustering
Persistent URL	doi.org/10.1109/JBHI.2021.3129461
Journal	IEEE Journal of Biomedical and Health Informatics
Organisation	Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Zhong, H., Loukides, G., & Pissis, S. (2021). Clustering demographics and sequences of diagnosis codes. IEEE Journal of Biomedical and Health Informatics, 26(5), 2351–2359. doi:10.1109/JBHI.2021.3129461

View at Publisher

Full Text ( Author Manuscript , 19mb )

Clustering demographics and sequences of diagnosis codes

Publication

Publication

Address

CWI researchers

Questions or comments?

Clustering demographics and sequences of diagnosis codes

Publication

Publication

Workflow

Workflow

Add Content