In a weighted sequence, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. Weighted sequences are commonly used to represent imprecise or uncertain data, for example in molecular biology, where they are known under the name of Position Weight Matrices. Given a probability threshold 1/z , we say that a string P of length m occurs in a weighted sequence X at position i if the product of probabilities of the letters of P at positions i, . . . , i+m−1 in X is at least 1/z . In this article, we consider an indexing variant of the problem, in which we are to pre-process a weighted sequence to answer multiple pattern matching queries. We present an O(nz)-time construction of an O(nz)-sized index for a weighted sequence of length n that answers pattern matching queries in the optimal O(m+Occ) time, where Occ is the number of occurrences reported. The cornerstone of our data structure is a novel construction of a family of [z] strings that carries the information about all the strings that occur in the weighted sequence with a sufficient probability. We thus improve the most efficient previously known index by Amir et al. (Theor. Comput. Sci., 2008) with size and construction time O(nz2 log z), preserving optimal query time. On the way we develop a new, more straightforward index for the so-called property matching problem. We provide an open-source implementation of our data structure and present experimental results using both synthetic and real data. Our construction allows us also to obtain a significant improvement over the complexities of the approximate variant of the weighted index presented by Biswas et al. at EDBT 2016 and an improvement of the space complexity of their general index. We also present applications of our index.
, , , ,
Information and Computation
Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands

Barton, C., Kociumaka, T., Liu, C., Pissis, S., & Radoszewski, J. (2019). Indexing weighted sequences: Neat and efficient. Information and Computation, 270, 104462:1–104462:21. doi:10.1016/j.ic.2019.104462