We study the problem of making a de Bruijn graph (dBG), constructed from a collection of strings, weakly connected while minimizing the total cost of edge additions. The input graph is a dBG that can be made weakly connected by adding edges (along with extra nodes if needed) from the underlying complete dBG. The problem arises from genome reconstruction, where the dBG is constructed from a set of sequences generated from a genome sample by a sequencing experiment. Due to sequencing errors, the dBG is never Eulerian in practice and is often not even weakly connected. We show the following results for a dBG G(V, E) of order k consisting of d weakly connected components: 1. Making G weakly connected by adding a set of edges of minimal total cost is NP-hard. 2. No PTAS exists for making G weakly connected by adding a set of edges of minimal total cost (unless the unique games conjecture fails). We complement this result by showing that there does exist a polynomial-time (2 − 2/d)-approximation algorithm for the problem. 3. We consider a restricted version of the above problem, where we are asked to make G weakly connected by only adding directed paths between pairs of components. We show that making G weakly connected by adding d− 1 such paths of minimal total cost can be done in O(k|V |α(|V |) + |E|) time, where α(·) is the inverse Ackermann function. This improves on the O(k|V | log(|V |) + |E|)-time algorithm proposed by Bernardini et al. [CPM 2022] for the same restricted problem. 4. An ILP formulation of polynomial size for making G Eulerian with minimal total cost.

, , ,
doi.org/10.4230/LIPIcs.CPM.2024.6
Leibniz International Proceedings in Informatics (LIPIcs)
35th Annual Symposium on Combinatorial Pattern Matching, CPM 2024
Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands

Bernardini, G., Chen, H., Gørtz, I. L., Krogh, C., Loukides, G., Pissis, S., … Sweering, M. (2024). Connecting de Bruijn graphs. In Proceedings of the 35th Annual Symposium on Combinatorial Pattern Matching, CPM 2024 (pp. 6:1–6:16). doi:10.4230/LIPIcs.CPM.2024.6