2021-08-31

# Faster algorithms for longest common substring

## Publication

### Publication

*Presented at the 29th Annual European Symposium on Algorithms (ESA 2021) (September 2021), Online, Lisbon, Portugal*

In the classic longest common substring (LCS) problem, we are given two strings S and T, each of length at most n, over an alphabet of size σ, and we are asked to find a longest string occurring as a fragment of both S and T. Weiner, in his seminal paper that introduced the suffix tree, presented an (n log σ)-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an (n)-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in (n log σ/log n) space and read in (n log σ/log n) time. We show that, in this model, we can compute an LCS in time (n log σ / √{log n}), which is sublinear in n if σ = 2^{o(√{log n})} (in particular, if σ = (1)), using optimal space (n log σ/log n).

We then lift our ideas to the problem of computing a k-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of S that occurs in T with at most k mismatches. Flouri et al. showed how to compute a 1-mismatch LCS in (n log n) time [IPL 2015]. Thankachan et al. extended this result to computing a k-mismatch LCS in (n log^k n) time for k = (1) [J. Comput. Biol. 2016]. We show an (n log^{k-1/2} n)-time algorithm, for any constant integer k > 0 and irrespective of the alphabet size, using (n) space as the previous approaches. We thus notably break through the well-known n log^k n barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with k errors.

Additional Metadata | |
---|---|

, , | |

Samsung R&D Institute, Warsaw, Poland | |

doi.org/10.4230/LIPIcs.ESA.2021.30 | |

Leibniz International Proceedings in Informatics | |

29th Annual European Symposium on Algorithms (ESA 2021) | |

Organisation | Life Sciences and Health |

Charalampopoulos, P, Kociumaka, T, Pissis, S, & Radoszewski, J. (2021). Faster algorithms for longest common substring. In
Annual European Symposium on Algorithms (pp. 30:1–30:17). doi:10.4230/LIPIcs.ESA.2021.30 |