Bidirectional String Anchors for Improved Text Indexing and Top-Ƙ Similarity Search
The minimizers sampling mechanism is a popular mechanism for string sampling. However, minimizers sampling mechanisms lack good guarantees on the expected size of their samples for different combinations of their input parameters. Furthermore, indexes constructed over minimizers samples lack good worst-case guarantees for on-line pattern searches. In response, we propose bidirectional string anchors (bd-anchors), a new string sampling mechanism. Given an integer ℓ, our mechanism selects the lexicographically smallest rotation in every length-ℓ fragment. We show that, like minimizers samples, bd-anchors samples are approximately uniform, locally consistent, and computable in linear time. Furthermore, our experiments demonstrate that the bd-anchors sample sizes decrease proportionally to ℓ ; and that these sizes are
competitive to or smaller than the minimizers sample sizes. We theoretically justify these results by analyzing the expected size of bd-anchors samples. We also prove that computing a total order on the input alphabet which minimizes the bd-anchors sample size is NP-hard. We next highlight the benefits of bd-anchors in two important applications: text indexing and top-Ƙ similarity search. For the first application, we develop an index for performing on-line pattern searches in near-optimal time, and show experimentally that a simple implementation of our index is consistently faster for on-line pattern searches than an analogous implementation of a minimizers-based index; we also show that it is substantially faster than two classic text indexes. For the second application, we develop a heuristic for top-Ƙ similarity search under edit distance, and show experimentally that it is generally as accurate as the state-of-the-art tool for the same purpose but more than one order of magnitude faster.
|, , , , , , , , ,|
|IEEE Transactions on Knowledge and Data Engineering|
|Organisation||Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands|
Loukides, G, Pissis, S, & Sweering, M.J.M. (2023). Bidirectional String Anchors for Improved Text Indexing and Top-Ƙ Similarity Search. IEEE Transactions on Knowledge and Data Engineering. doi:10.1109/TKDE.2022.3231780