Hi folks, Got to know about this problem from a friend. It was asked in codenation's coding round/interview last year.
Any leads on a better approach than $$$O(n^2)$$$ will be appreciated.
Question — You are given n strings and q queries. Each query is also in the form of a string (say $$$q_i$$$). You have to reply with true if this string can be formed by concatenating any of the n strings given earlier.
For example — Suppose $$$n = 3$$$ and the strings are $$$abc$$$ , $$$bc$$$ and $$$cd$$$.
So $$$abcabc$$$ is a valid string because you can use $$$abc$$$ twice. (There is no constraint on the number of times you can use any given string.)
Also, $$$bccdabc$$$ is a valid string as we can use the given strings in the order $$$2$$$, $$$3$$$ and $$$1$$$.
Sum of all the strings in query is $$$\leq 10000$$$
My friend final_tsu helped me in coming with a $$$O(n^2)$$$ approach using Trie and DP.
Thanks.
Some weeks ago, tfg told me how to solve it online with time and space $$$O(N * sqrt(N))$$$.
It is possible to solve it online using dp + sqrt decomposition.
To answer a query of string $$$t$$$, we build
dp[i] = "it is possible to concatenate some strings from your set to form the prefix with i characters"
. So it looks likedp[i + j] = dp[i] if there is a match of a string(from your set) with size j starting at position i + 1
The answer for the query is
dp[t.size()]
.To build the
dp
we need some precomputation.Let $$$S$$$ be the sum of sizes of all $$$n$$$ given strings. We build a trie with all strings with size < $$$sqrt(S)$$$ and build prefix function for all the others. The trie has height at most $$$sqrt(S)$$$ and there are at most $$$sqrt(S)$$$ prefix functions saved.
Traverse the trie starting with every position of $$$t$$$ and save all ranges $$$(l, r)$$$ of $$$t$$$ that matches some string on the trie. This works in $$$O(t.size() * sqrt(S))$$$.
Now for all big strings with size <= t.size() do the usual kmp matching and save ranges $$$(l, r)$$$ as before. This also works in $$$O(t.size() * sqrt(S))$$$.
You ended up with $$$O(t.size() * sqrt(S))$$$ ranges that you can use to compute the dp.
Thanks a lot for the clear explanation :)
Are you sure this uses linear memory? The same time complexity can be achieved by using Aho-Corasick — no need to explicitly divide strings into long and short ones. If the state of the Aho-Corasick automaton is $$$x$$$ after reaching the prefix $$$t_{[0,i)}$$$, the value of $$$dp_i$$$ can be found by following dictionary links from node $$$x$$$. Since depths of nodes on this dictionary path are strictly decreasing and they all correspond to strings from the set $$$S$$$, if the total length of the $$$n$$$ strings is $$$L$$$ then there are no more than $$$O(\sqrt L)$$$ of these dictionary nodes, so the algorithm runs in $$$O(L + |t|\sqrt L)$$$ time and uses linear memory.
$$$\textit{Are you sure this uses linear memory?}$$$
I was wrong. tfg's solution also divided the queries into long and short ones to achieve linear complexity. I thought this simplification kept linear but I forgot about the saved ranges.
Your solution with aho is really nice because it's just direct use of it.
Yes, that solution uses linear memory. The trie is linear memory, the borders precomputation uses linear memory, keeping the KMP states uses O(sqrt(L)) memory. You don't need to explicitly have a vector of occurences, just proccess them as they come (as in here but keeping the kmp states for each big pattern). I'm now aware of that Aho-Corasick solution ever since this problem. Your solution is offline though, but you can easily make it online by using the trick to maintain O(log) sets, it'll work in O(sqrt(L) + log(N)) per character if we assume no equal strings are given, and that's easily treatable.
Edit: online as in queries could be adding patterns.