Substring complexity in sublinear space

Bernardini, Giulia; Fici, Gabriele; Gawrychowski, Paweł; Pissis, Solon

doi:10.4230/LIPIcs.ISAAC.2023.12

G. Bernardini (Giulia), G. Fici (Gabriele), P. Gawrychowski (Paweł) and S. Pissis (Solon)

2023-11-28

Substring complexity in sublinear space

Presented at the 34th International Symposium on Algorithms and Computation, ISAAC 2023 (December 2023), Kyoto, Japan

Shannon’s entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size z of the Lempel–Ziv parse or the number r of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ of a smallest string attractor. Let T be a string of length n. A string attractor of T is a set of positions of T capturing the occurrences of all the substrings of T. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure of compressibility that is based on the function ST(k) counting the number of distinct substrings of length k of T, also known as the substring complexity of T. This new measure is defined as δ = sup{ST(k)/k,k ≥ 1} and lower bounds all the relevant ad hoc measures previously considered. In particular, δ ≤ γ always holds and δ can be computed in O(n) time using Θ(n) working space. Kociumaka et al. showed that one can construct an O(δlog nδ)-sized representation of T supporting efficient direct access and efficient pattern matching queries on T. Given that for highly compressible strings, δ is significantly smaller than n, it is natural to pose the following question: Can we compute δ efficiently using sublinear working space? It is straightforward to show that in the comparison model, any algorithm computing δ using O(b) space requires Ω(n2−o(1)/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We thus wanted to investigate whether we can indeed match this lower bound. We address this algorithmic challenge by showing the following bounds to compute δ: O(n3blog2b) time using O(b) space, for any b ∈ [1,n], in the comparison model. Õ(n2/b)1 time using Õ(b) space, for any b ∈ [√n,n], in the word RAM model. This gives an Õ(n1+ϵ)-time and Õ(n1−ϵ)-space algorithm to compute δ, for any 0 < ϵ ≤ 1/2. Let us remark that our algorithms compute ST(k), for all k, within the same complexities.

Additional Metadata
Keywords	String algorithm, Sublinear-space algorithm, Substring complexity
Persistent URL	doi.org/10.4230/LIPIcs.ISAAC.2023.12
Series	Leibniz International Proceedings in Informatics
Project	Pan-genome Graph Algorithms and Data Integration , Algorithms for PAngenome Computational Analysis
Conference	34th International Symposium on Algorithms and Computation, ISAAC 2023
Grant	This work was funded by the European Commission 7th Framework Programme; grant id h2020/872539 - Pan-genome Graph Algorithms and Data Integration (PANGAIA), This work was funded by the European Commission 7th Framework Programme; grant id h2020/956229 - Algorithms for PAngenome Computational Analysis (ALPACA)
Organisation	Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Bernardini, G., Fici, G., Gawrychowski, P., & Pissis, S. (2023). Substring complexity in sublinear space. In International Symposium on Algorithms and Computation (pp. 12:1–12:19). doi:10.4230/LIPIcs.ISAAC.2023.12

View at Publisher

Free Full Text ( Final Version , 990kb )

Substring complexity in sublinear space

Publication

Publication

Address

Publishing at CWI

Questions or comments?

Substring complexity in sublinear space

Publication

Publication

Workflow

Workflow

Add Content