2022-06-01

# On Strings Having the Same Length-k Substrings

## Publication

### Publication

*Presented at the Annual Symposium on Combinatorial Pattern Matching (June 2022), Prague, Czech Republic*

Let Substrk(X) denote the set of length-k substrings of a given string X for a given integer k > 0. We study the following basic string problem, called z-Shortest Sk-Equivalent Strings: Given a set Sk of n length-k strings and an integer z > 0, list z shortest distinct strings T1,..., Tz such that Substrk(Ti) = Sk, for all i ∈ [1, z]. The z-Shortest Sk-Equivalent Strings problem arises naturally as an encoding problem in many real-world applications; e.g., in data privacy, in data compression, and in bioinformatics. The 1-Shortest Sk-Equivalent Strings, referred to as Shortest Sk-Equivalent String, asks for a shortest string X such that Substrk(X) = Sk. Our main contributions are summarized below:

Given a directed graph G(V, E), the Directed Chinese Postman (DCP) problem asks for a shortest closed walk that visits every edge of G at least once. DCP can be solved in Õ(|E||V |) time using an algorithm for min-cost flow. We show, via a non-trivial reduction, that if Shortest Sk-Equivalent String over a binary alphabet has a near-linear-time solution then so does DCP.

We show that the length of a shortest string output by Shortest Sk-Equivalent String is in O(k + n2). We generalize this bound by showing that the total length of z shortest strings is in O(zk + zn2 + z2n). We derive these upper bounds by showing (asymptotically tight) bounds on the total length of z shortest Eulerian walks in general directed graphs.

We present an algorithm for solving z-Shortest Sk-Equivalent Strings in O(nk + n2 log2 n + zn2 log n + |output|) time. If z = 1, the time becomes O(nk + n2 log2 n) by the fact that the size of the input is Θ(nk) and the size of the output is O(k + n2).

Additional Metadata | |
---|---|

, , , | |

doi.org/10.4230/LIPIcs.CPM.2022.16 | |

Leibniz International Proceedings in Informatics | |

Pan-genome Graph Algorithms and Data Integration , Algorithms for PAngenome Computational Analysis , Optimization for and with Machine Learning , Networks | |

Annual Symposium on Combinatorial Pattern Matching | |

, , , | |

Organisation | Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands |

Bernardini, G., Conte, A., Gabory, E., Grossi, R., Loukides, G., Pissis, S., … Sweering, M. (2022). On Strings Having the Same Length-k Substrings. In 33rd Annual Symposium on Combinatorial Pattern Matching (pp. 16:1–16:17). doi:10.4230/LIPIcs.CPM.2022.16 |