Hi,
For this problem I have a set S with initially 0 strings.
There are three types of queries:
- Add a string to set S with id = queryId.
- Remove the string from set S with id = removeId.
- Read a string M and print all strings in set S that have M as substring.
Let N be the size of S.
How would I solve this efficiently? I hope to reach a complexity of O((length of search string) + N) for query 3 and O(length of string) for queries 1 and 2. Is this possible, with for example a suffix tree?
I would like to use this for a fast search engine for a (big) data set that is changing dynamically.
Auto comment: topic has been updated by Robbinb1993 (previous revision, new revision, compare).
Auto comment: topic has been updated by Robbinb1993 (previous revision, new revision, compare).
You can do it with an extra log overhead using suffix automaton.
Cool, do you have any sources how to support the removal of a string from the suffix automaton?
https://codeforces.me/edu/course/2/lesson/2
Thanks, I see this is a suffix array instead of suffix automaton though. But both should be possible. I don't see the mentioning of deleting an added string from the suffix array, I am not sure if it is possible to do this after creation.
For suffix automaton and suffix array it is built from only one string, unlike suffix tree. So we should use a delimiter to support multiple strings to be searched. I am not sure if this would support a deletion operation of one of the substrings contained in the concatenated string from which the data structure was built.
Create suffix arrays for strings being added $$$O(S\log(s))$$$. When deleting remove their suffix arrays.
I see. But I think the complexity for a search will be more efficient when using a Suffix tree. It will only be a traversal of M characters (the search pattern string length) and the vertices in the suffix tree will store all ID's of the strings that are below it in the tree (which are added when the string was added to the suffix tree, on its path). If we are at our final character of the search pattern, we will return the set of strings with endpoints below this vertex in the tree and this is our result. Which is only O(M + N) for a search.
You can actually generalise the suffix automaton for multiple strings too (without adding the annoying delimiter) with a very small modification to the construction algorithm. Basically, whenever you're adding a new string, you reset the last node to 0/the root. You then need to also handle the case of already having a transition with the letter you're adding, but that's pretty similar to the normal SA cases. Here's the implementation I use. When you add a new string, you simply set last=0 and add the letter as normal.
Note that for the specific problem you have, this approach needs some tweaking as if the queries come online, you'll might change the structure of the SA, meaning that you need to be careful if you use the internal suffix links that form a tree (which is probably the easiest way to perform what you described). The easiest solution is probably to solve everything offline, by pre-building the generalised SA, and maintaining the suffix tree with a flag on every node. Now every addition/removal of a string is equivalent to adding/removing 1 on a path to the root of a tree, and the query is equivalent to counting the number of nonzero elements. All of those can be handled easily in $$$O(\log)$$$.
Thank you, that is a nice modification. In my case it has to be used in practice where you don't know in front what queries and what input will be given, similar to the interactive problems. So unfortunately an offline solution is not possible. The use case is a big product table of a retailer in which data can be modified, added and deleted at any time. And users should be able to search quickly for search pattern matches in this list of data.
If you need it for production, maybe a relatively good idea would be to use bucketing + the idea I mentioned, say by rebuilding the SA on every $$$B=500$$$ updates. Then you'll be able to query the latest version to get the matches and after that you'll only cross check with at most $$$B$$$ strings (which should be fast enough).
Yes I think this is the way to go. Seems like a deletion operation is not so easy to implement for suffix data structures. Not a lot can be found about this.