Constructing antidictionaries in output-sensitive space

Ayad, Lorraine; Badkobeh, Golnaz; Fici, Gabriele; Heliou, Alice; Pissis, Solon

doi:10.1109/DCC.2019.00062

L.A.K. Ayad (Lorraine), G. Badkobeh (Golnaz), G. Fici (Gabriele), A. Heliou (Alice) and S. Pissis (Solon)

2019-03-26

Constructing antidictionaries in output-sensitive space

Presented at the 2019 Data Compression Conference, DCC 2019 (March 2019), Snowbird, Utah, USA

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y₁, y₂,...,y_k over an alphabet Σ, we are asked to compute the set M(y ^1#...# y ^k ) ^ℓ of minimal absent words of length at most ℓ of word y=y₁#y₂#...#y_k, #∉Σ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M(y ^1#...# y ^N ) ^ℓ || =o(n), for all N ϵ[1, k]. For instance, in the human genome, n ≈ 3 × 10 ⁹ but ||M (y ^1#...#yk ) ¹² || ≈ 10 ⁶. We consider a constant-sized alphabet for stating our results. We show that all M(y ₁ ) ^ℓ ,...,M(y _1#...# y _k ) ^ℓ can be computed in O(kn+Σ _N=1 ^k ||M(y ₁ #...#(y _N ) ^ℓ ||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in y ₁ ,...,y _k and MaxOut=max{||M (y _1)#...# (y _N ) ^ℓ ||:N ϵ[1, k]. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

Additional Metadata
Keywords	Absent words, Antidictionaries, Data compression, Output sensitive algorithms, String algorithms
Persistent URL	doi.org/10.1109/DCC.2019.00062
Conference	2019 Data Compression Conference, DCC 2019
Organisation	Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Ayad, L., Badkobeh, G., Fici, G., Heliou, A., & Pissis, S. (2019). Constructing antidictionaries in output-sensitive space. In Data Compression Conference Proceedings (pp. 538–547). doi:10.1109/DCC.2019.00062

View at Publisher

See Also
article Constructing antidictionaries of long texts in output-sensitive space L.A.K. Ayad (Lorraine), G. Badkobeh (Golnaz), G. Fici (Gabriele), A. Heliou (Alice) and S. Pissis (Solon)

Constructing antidictionaries in output-sensitive space

Publication

Publication

article
Constructing antidictionaries of long texts in output-sensitive space

Address

CWI researchers

Questions or comments?

Constructing antidictionaries in output-sensitive space

Publication

Publication

article Constructing antidictionaries of long texts in output-sensitive space

Workflow

Workflow

Add Content

article
Constructing antidictionaries of long texts in output-sensitive space