A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y1, y2,...,yk over an alphabet Σ, we are asked to compute the set M(y 1#...# y k ) of minimal absent words of length at most ℓ of word y=y1#y2#...#yk, #∉Σ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M(y 1#...# y N ) || =o(n), for all N ϵ[1, k]. For instance, in the human genome, n ≈ 3 × 10 9 but ||M (y 1#...#yk ) 12 || ≈ 10 6. We consider a constant-sized alphabet for stating our results. We show that all M(y 1 ) ,...,M(y 1#...# y k ) can be computed in O(kn+Σ N=1 k ||M(y 1 #...#(y N ) ||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in y 1 ,...,y k and MaxOut=max{||M (y 1)#...# (y N ) ||:N ϵ[1, k]. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.
Absent words, Antidictionaries, Data compression, Output sensitive algorithms, String algorithms
2019 Data Compression Conference, DCC 2019
Centrum Wiskunde & Informatica, Amsterdam, The Netherlands

Ayad, L.A.K, Badkobeh, G, Fici, G, Heliou, A, & Pissis, S. (2019). Constructing antidictionaries in output-sensitive space. In Data Compression Conference Proceedings (pp. 538–547). doi:10.1109/DCC.2019.00062