Exponential Length Substrings in Pattern Matching

Revision en4, by catalystgma, 2024-09-30 22:10:13

Hi all,

I would like to share with you a part of my undergraduate thesis on a Multi-String Pattern Matcher data structure. In my opinion, it's easy to understand and hard to implement correctly and efficiently. It's competitive against other MSPM data structures (Aho-Corasick, suffix array/automaton/tree to name a few) when the dictionary size is specifically (uncommonly) large.

I would also like to sign up this entry to bashkort's Month of Blog Posts:-)

Abstract

This work describes a hash-based mass-searching algorithm, finding (count, location of first match) entries from a dictionary against a string $$$s$$$ of length $$$n$$$. The presented implementation makes use of all substrings of $$$s$$$ whose lengths are powers of $$$2$$$ to construct an offline algorithm that can, in some cases, reach a complexity of $$$O(n \log^2n)$$$ even if there are $$$O(n^2)$$$ possible matches. If there is a limit on the dictionary size $$$m$$$, then the precomputation complexity is $$$O(m + n \log^2n)$$$, and the search complexity is bounded by $$$O(n\log^2n + m\log n)$$$, even if it performs in practice like $$$O(n\log^2n + \sqrt{nm}\log n)$$$. Other applications, such as finding the number of distinct substrings of $$$s$$$ for each length between $$$1$$$ and $$$n$$$, can be done with the same algorithm in $$$O(n\log^2n)$$$.

Problem Description

We want to write an offline algorithm for the following problem, which receives as input a string $$$s$$$ of length $$$n$$$, and a dictionary $$$ts = \{t_1, t_2, .., t_{\lvert ts \rvert}\}$$$. As output, it expects for each string $$$t$$$ in the dictionary the number of times it is found in $$$s$$$. We could also ask for the position of the fist occurrence of each $$$t$$$ in $$$s$$$, but the paper mainly focuses on the number of matches.

Algorithm Description

We will build a DAG in which every node is mapped to a substring from $$$s$$$ whose length is a power of $$$2$$$. We will draw edges between any two nodes whose substrings are consecutive in $$$s$$$. The DAG has $$$O(n \log n)$$$ nodes and $$$O(n \log^2 n)$$$ edges.

We will break every $$$t_i \in ts$$$ down into a chain of substrings of $$$2$$$-exponential length in strictly decreasing order (e.g. if $$$\lvert t \rvert = 11$$$, we will break it into $$$\{t[1..8], t[9..10], t[11]\}$$$). If $$$t_i$$$ occurs $$$k$$$ times in $$$s$$$, we will find $$$t_i$$$'s chain $$$k$$$ times in the DAG.

Tags mspm, hashing, trie

History

 
 
 
 
Revisions
 
 
  Rev. Lang. By When Δ Comment
en31 English catalystgma 2024-10-06 14:47:18 0 (published)
en30 English catalystgma 2024-10-06 14:45:49 459 tag typo
en29 English catalystgma 2024-10-06 14:23:07 381 Small changes, tags
en28 English catalystgma 2024-10-06 13:10:51 648 update toy LLM test
en27 English catalystgma 2024-10-05 19:56:49 95
en26 English catalystgma 2024-10-04 22:27:05 391 table floats
en25 English catalystgma 2024-10-04 22:00:48 856 Acknowledgements completion
en24 English catalystgma 2024-10-04 18:04:25 1430 Acknowledgements. TODO run nlp and reread
en23 English catalystgma 2024-10-04 17:37:45 1102 LLM copyright
en22 English catalystgma 2024-10-04 14:15:40 1694 Second batch benchmark
en21 English catalystgma 2024-10-04 14:02:15 4179 First batch benchmark
en20 English catalystgma 2024-10-04 13:21:21 2521 Practical Results part 1
en19 English catalystgma 2024-10-02 19:24:23 3730 Complexity computation: with DAG compression
en18 English catalystgma 2024-10-02 19:09:40 4996 Property 4.2.8 ended
en17 English catalystgma 2024-10-02 18:59:38 1435 upto Property 4.2.6.
en16 English catalystgma 2024-10-02 18:55:05 4390 Property 4.2.5 ended
en15 English catalystgma 2024-10-02 15:36:08 3267 Complexity computation: with DAG compression (beginning, upto Theorem 4.2.3)
en14 English catalystgma 2024-10-02 15:14:13 4221 Corner Case Improvement
en13 English catalystgma 2024-10-01 14:51:07 1113 Complexity computation: without DAG compression
en12 English catalystgma 2024-10-01 14:43:35 726 Complexity computation: without DAG compression (next table)
en11 English catalystgma 2024-10-01 14:37:54 5235 Complexity computation: without DAG compression (half)
en10 English catalystgma 2024-10-01 14:04:04 2080 DAG suffix compression theory
en9 English catalystgma 2024-10-01 13:31:28 1689 Token count bound and Token propagation complexity
en8 English catalystgma 2024-10-01 12:55:22 1272 Trie search comments
en7 English catalystgma 2024-10-01 12:25:08 1159 Trie Search images in spoiler
en6 English catalystgma 2024-09-30 23:25:19 1380 Redundancy Elimination upto Trie Search
en5 English catalystgma 2024-09-30 23:02:01 139 add image of decent size
en4 English catalystgma 2024-09-30 22:10:13 107 Tiny change: 'DAG.\n\n![]()' -> 'DAG.\n\n![ ](https://codeforces.me/5076b6/beamer_img2.png)'
en3 English catalystgma 2024-09-30 19:57:10 7 Upload beamer images
en2 English catalystgma 2024-09-30 18:59:08 1032
en1 English catalystgma 2024-09-30 18:14:49 1342 Initial revision (saved to drafts)