Faster algorithms for longest common substring

Charalampopoulos, Panagiotis; Kociumaka, Tomasz; Pissis, Solon; Radoszewski, Jakub

doi:10.4230/LIPIcs.ESA.2021.30

P. Charalampopoulos (Panagiotis), T. Kociumaka (Tomasz), S. Pissis (Solon) and J. Radoszewski (Jakub)

2021-08-31

Faster algorithms for longest common substring

Presented at the 29th Annual European Symposium on Algorithms (ESA 2021) (September 2021), Online, Lisbon, Portugal

In the classic longest common substring (LCS) problem, we are given two strings S and T, each of length at most n, over an alphabet of size σ, and we are asked to find a longest string occurring as a fragment of both S and T. Weiner, in his seminal paper that introduced the suffix tree, presented an (n log σ)-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an (n)-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in (n log σ/log n) space and read in (n log σ/log n) time. We show that, in this model, we can compute an LCS in time (n log σ / √{log n}), which is sublinear in n if σ = 2^{o(√{log n})} (in particular, if σ = (1)), using optimal space (n log σ/log n).

We then lift our ideas to the problem of computing a k-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of S that occurs in T with at most k mismatches. Flouri et al. showed how to compute a 1-mismatch LCS in (n log n) time [IPL 2015]. Thankachan et al. extended this result to computing a k-mismatch LCS in (n log^k n) time for k = (1) [J. Comput. Biol. 2016]. We show an (n log^{k-1/2} n)-time algorithm, for any constant integer k > 0 and irrespective of the alphabet size, using (n) space as the previous approaches. We thus notably break through the well-known n log^k n barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with k errors.

Additional Metadata
Keywords	Longest common substring, k mismatches, Wavelet tree
Stakeholder	Samsung R&D Institute, Warsaw, Poland
Persistent URL	doi.org/10.4230/LIPIcs.ESA.2021.30
Series	Leibniz International Proceedings in Informatics (LIPIcs)
Conference	29th Annual European Symposium on Algorithms (ESA 2021)
Organisation	Evolutionary Intelligence
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Charalampopoulos, P., Kociumaka, T., Pissis, S., & Radoszewski, J. (2021). Faster algorithms for longest common substring. In Annual European Symposium on Algorithms (pp. 30:1–30:17). doi:10.4230/LIPIcs.ESA.2021.30

View at Publisher

Free Full Text ( Final Version , 968kb )

Faster algorithms for longest common substring

Publication

Publication

Address

CWI researchers

Questions or comments?

Faster algorithms for longest common substring

Publication

Publication

Workflow

Workflow

Add Content