Indexing weighted sequences: Neat and efficient

Barton, Carl; Kociumaka, Tomasz; Liu, Chang; Pissis, Solon; Radoszewski, Jakub

doi:10.1016/j.ic.2019.104462

C. Barton (Carl), T. Kociumaka (Tomasz), C. Liu (Chang), S. Pissis (Solon) and J. Radoszewski (Jakub)

2019-09-04

Indexing weighted sequences: Neat and efficient

Information and Computation , Volume 270 p. 104462:1- 104462:21

In a weighted sequence, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. Weighted sequences are commonly used to represent imprecise or uncertain data, for example in molecular biology, where they are known under the name of Position Weight Matrices. Given a probability threshold 1/z , we say that a string P of length m occurs in a weighted sequence X at position i if the product of probabilities of the letters of P at positions i, . . . , i+m−1 in X is at least 1/z . In this article, we consider an indexing variant of the problem, in which we are to pre-process a weighted sequence to answer multiple pattern matching queries. We present an O(nz)-time construction of an O(nz)-sized index for a weighted sequence of length n that answers pattern matching queries in the optimal O(m+Occ) time, where Occ is the number of occurrences reported. The cornerstone of our data structure is a novel construction of a family of [z] strings that carries the information about all the strings that occur in the weighted sequence with a sufficient probability. We thus improve the most efficient previously known index by Amir et al. (Theor. Comput. Sci., 2008) with size and construction time O(nz² log z), preserving optimal query time. On the way we develop a new, more straightforward index for the so-called property matching problem. We provide an open-source implementation of our data structure and present experimental results using both synthetic and real data. Our construction allows us also to obtain a significant improvement over the complexities of the approximate variant of the weighted index presented by Biswas et al. at EDBT 2016 and an improvement of the space complexity of their general index. We also present applications of our index.

Additional Metadata
Keywords	Position weight matrix (PWM), Property indexing, Suffix tree, Text indexing, Weighted sequence
Persistent URL	doi.org/10.1016/j.ic.2019.104462
Journal	Information and Computation
Organisation	Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Barton, C., Kociumaka, T., Liu, C., Pissis, S., & Radoszewski, J. (2019). Indexing weighted sequences: Neat and efficient. Information and Computation, 270, 104462:1–104462:21. doi:10.1016/j.ic.2019.104462

View at Publisher

Full Text ( Author Manuscript , 458kb )

Indexing weighted sequences: Neat and efficient

Publication

Publication

Address

Publishing at CWI

Questions or comments?

Indexing weighted sequences: Neat and efficient

Publication

Publication

Workflow

Workflow

Add Content