A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y1, y2,...,yk over an alphabet Σ, we are asked to compute the set M(y 1#...# y k ) of minimal absent words of length at most ℓ of word y=y1#y2#...#yk, #∉Σ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M(y 1#...# y N ) || =o(n), for all N ϵ[1, k]. For instance, in the human genome, n ≈ 3 × 10 9 but ||M (y 1#...#yk ) 12 || ≈ 10 6. We consider a constant-sized alphabet for stating our results. We show that all M(y 1 ) ,...,M(y 1#...# y k ) can be computed in O(kn+Σ N=1 k ||M(y 1 #...#(y N ) ||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in y 1 ,...,y k and MaxOut=max{||M (y 1)#...# (y N ) ||:N ϵ[1, k]. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.
Additional Metadata
Keywords Absent words, Antidictionaries, Data compression, Output sensitive algorithms, String algorithms
Persistent URL dx.doi.org/10.1109/DCC.2019.00062
Conference 2019 Data Compression Conference, DCC 2019
Ayad, Lorraine, Badkobeh, G, Fici, G, Heliou, A, & Pissis, S.P. (2019). Constructing antidictionaries in output-sensitive space. In Data Compression Conference Proceedings (pp. 538–547). doi:10.1109/DCC.2019.00062