2023-06-21

# Suffix-prefix queries on a dictionary

## Publication

### Publication

*Presented at the 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023 (June 2023), Marne-la-Vallée, France*

In the all-pairs suffix-prefix (APSP) problem, we are given a dictionary R of k strings, S1, . . ., Sk, of total length n, and we are asked to find the length SPLi,j of the longest string that is both a suffix of Si and a prefix of Sj, for all i, j ∈ [1, k]. APSP is a classic problem in string algorithms with many applications in bioinformatics. When all strings of the dictionary are over an integer alphabet of size σ ≤ nO(1), APSP can be solved in the optimal O(n + k2) time with the use of the generalized suffix tree of the dictionary [Gusfield et al., Inf. Process. Lett. 1992]. In many bioinformatics applications, such as in sequence assembly, the size k of dictionary R is very large. In particular, k2 usually dominates n, and thus the k2 factor is the bottleneck both in the time and in the space complexity of such applications. We thus initiate a holistic study on several data structure variants of APSP. In particular, we consider the following types of queries: One-to-One(i, j): output SPLi,j. One-to-All(i): output SPLi,j for every j ∈ [1, k]. Report(i, ℓ): output all distinct j ∈ [1, k] such that SPLi,j ≥ ℓ, where ℓ ≥ 0 is an integer. Count(i, ℓ): output the number of distinct j ∈ [1, k] such that SPLi,j ≥ ℓ, where ℓ ≥ 0 is an integer. Top(i, K): output K distinct j ∈ [1, k] with the highest values of SPLi,j breaking ties arbitrarily. We assume the standard word RAM model of computation with word size w = Ω(log n) and an integer alphabet of size σ ≤ nO(1). We show the following upper bounds: Query Space (words) Query time Note One-to-One(i, j) O(n) O(log log k) Theorem 11 One-to-All(i) O(n) O(k) Theorem 14 Report(i, ℓ) O(n) O(log n/log log n + output) Theorem 19(i) Count(i, ℓ) O(n) O(log n/log log n) Theorem 19(ii) Top(i, K) O(n) O(log2 n/log log n + K) Theorem 22 We also present efficient algorithms for constructing these data structures.

Additional Metadata | |
---|---|

, , | |

doi.org/10.4230/LIPIcs.CPM.2023.21 | |

Leibniz International Proceedings in Informatics | |

Pan-genome Graph Algorithms and Data Integration , Algorithms for PAngenome Computational Analysis , Networks | |

34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023 | |

, , | |

Organisation | Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands |

Loukides, G., Pissis, S., Thankachan, S., & Zuba, W. (2023). Suffix-prefix queries on a dictionary. In Leibniz International Proceedings in Informatics, LIPIcs (pp. 21:1–21:20). doi:10.4230/LIPIcs.CPM.2023.21 |