2022-06-22
Making de Bruijn graphs Eulerian
Publication
Publication
A directed multigraph is called Eulerian if it has a circuit which uses each edge exactly once. Euler's theorem tells us that a weakly connected directed multigraph is Eulerian if and only if every node is balanced. Given a collection S of strings over an alphabet Σ, the de Bruijn graph (dBG) of order k of S is a directed multigraph GS,k(V, E), where V is the set of length-(k − 1) substrings of the strings in S, and GS,k contains an edge (u, v) with multiplicity mu,v, if and only if the string u[0] · v is equal to the string u · v[k − 2] and this string occurs exactly mu,v times in total in strings in S. Let GΣ,k(VΣ,k, EΣ,k) be the complete dBG of Σk. The Eulerian Extension (EE) problem on GS,k asks to extend GS,k with a set B of nodes from VΣ,k and a smallest multiset A of edges from EΣ,k to make it Eulerian. Note that extending dBGs is algorithmically much more challenging than extending general directed multigraphs because some edges in dBGs are by definition forbidden. Extending dBGs lies at the heart of sequence assembly [Medvedev et al., WABI 2007], one of the most important tasks in bioinformatics. The novelty of our work with respect to existing works is that we allow not only to duplicate existing edges of GS,k but to also add novel edges and nodes, in an effort to (i) connect multiple components and (ii) reduce the total EE cost. It is easy to show that EE on GS,k is NP-hard via a reduction from shortest common superstring. We further show that EE remains NP-hard, even when we are not allowed to add new nodes, via a highly non-trivial reduction from 3-SAT. We thus investigate the following two problems underlying EE in dBGs: 1. When GS,k is not weakly connected, we are asked to connect its d > 1 components using a minimum-weight spanning tree, whose edges are paths on the underlying GΣ,k and weights are the corresponding path lengths. This way of connecting guarantees that no new unbalanced node is added. We show that this problem can be solved in O(|V |k log d + |E|) time, which is nearly optimal, since the size of GS,k is Θ(|V |k + |E|). 2. When GS,k is not balanced, we are asked to extend GS,k to HS,k(V ∪ B, E ∪ A) such that every node of HS,k is balanced and the total number |A| of added edges is minimized. We show that this problem can be solved in the optimal O(k|V | + |E| + |A|) time. Let us stress that, although our main contributions are theoretical, the algorithms we design for the above two problems are practical. We combine the two algorithms in one method that makes any dBG Eulerian; and show experimentally that the cost of the obtained feasible solutions on real-world dBGs is substantially smaller than the corresponding cost obtained by existing greedy approaches.
Additional Metadata | |
---|---|
, , , | |
doi.org/10.4230/LIPIcs.CPM.2022.12 | |
Leibniz International Proceedings in Informatics, LIPIcs | |
Networks | |
33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022 | |
Organisation | Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands |
Bernardini, G., Chen, H., Loukides, G., Pissis, S., Stougie, L., & Sweering, M. (2022). Making de Bruijn graphs Eulerian. In Annual Symposium on Combinatorial Pattern Matching (pp. 12:1–12:18). doi:10.4230/LIPIcs.CPM.2022.12 |