Clustering sequence graphs

Zhong, Haodi; Loukides, Grigorios; Pissis, Solon

doi:10.1016/j.datak.2022.101981

H. Zhong (Haodi), G. Loukides (Grigorios) and S. Pissis (Solon)

2022-03-01

Clustering sequence graphs

Data & Knowledge Engineering , Volume 138 p. 101981:1- 101981:21

In application domains ranging from social networks to e-commerce, it is important to cluster users with respect to both their relationships (e.g., friendship or trust) and their actions (e.g., visited locations or rated products). Motivated by these applications, we introduce here the task of clustering the nodes of a sequence graph, i.e., a graph whose nodes are labeled with strings (e.g., sequences of users’ visited locations or rated products). Both string clustering algorithms and graph clustering algorithms are inappropriate to deal with this task, as they do not consider the structure of strings and graph simultaneously. Moreover, attributed graph clustering algorithms generally construct poor solutions because they need to represent a string as a vector of attributes, which inevitably loses information and may harm clustering quality. We thus introduce the problem of clustering a sequence graph. We first propose two pairwise distance measures for sequence graphs, one based on edit distance and shortest path distance and another one based on SimRank. We then formalize the problem under each measure, showing also that it is NP-hard. In addition, we design a polynomial-time 2-approximation algorithm, as well as a heuristic for the problem. Experiments using real datasets and a case study demonstrate the effectiveness and efficiency of our methods.

Additional Metadata
Keywords	Sequence clustering, Graph clustering, Sequential data
Persistent URL	doi.org/10.1016/j.datak.2022.101981
Journal	Data & Knowledge Engineering
Project	Algorithms for PAngenome Computational Analysis , Pan-genome Graph Algorithms and Data Integration
Grant	This work was funded by the European Commission 7th Framework Programme; grant id h2020/956229 - Algorithms for PAngenome Computational Analysis (ALPACA), This work was funded by the European Commission 7th Framework Programme; grant id h2020/872539 - Pan-genome Graph Algorithms and Data Integration (PANGAIA)
Organisation	Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands
Citation APA APA Style APA-ALL Style AAA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Zhong, H., Loukides, G.& Pissis, S. (2022). Clustering sequence graphs. Data & Knowledge Engineering, 138, 101981:1–101981:21.https://doi.org/10.1016/j.datak.2022.101981

View at Publisher

Free Full Text ( Final Version , 1mb )

Clustering sequence graphs

Publication

Publication

Address

CWI researchers

Questions or comments?

Clustering sequence graphs

Publication

Publication

Workflow

Workflow

Add Content