Stecher's blog

By Stecher, history, 5 years ago, In English

How to calculate the number of the string S of length N, which does not contain given string T as a substring?

length(S) ~ 10000

I would like to know all the approaches for the following 2 constraints :

  1. length(T) ~ 3 -> I have some DP idea but would appreciate if you can elaborate it more
  2. length(T) ~ 100
  3. length(T) ~ 10000

EDIT: What if we want to find the number of string S which contains given string T as a substring at most K times?

Please let me know all the available approaches.

  • Vote: I like it
  • +52
  • Vote: I do not like it

»
5 years ago, # |
Rev. 3   Vote: I like it +11 Vote: I do not like it

2) DP[i][j] denoting current string has length i and its suffix shares j characters with prefix of T.

DP[i][j] updates from DP[i-1][j-1] for j > 0.

DP[i][0] updates from DP[i-1][k] for each k.

DP[i][j] can also be updated from DP[i-1][k] where k>j (preprocessing will be required to find)

  • »
    »
    5 years ago, # ^ |
      Vote: I like it +3 Vote: I do not like it

    nishant403 I think this solution can solve task 3 as well. What's the problem with that? And also I remember now I've solved a similar problem on cf as well.

    • »
      »
      »
      5 years ago, # ^ |
        Vote: I like it +11 Vote: I do not like it

      There are total N*N states and each state might require O(N) operations to update.

      In case like abacacacac, DP[i][2] updates from DP[i][3], DP[i][5] etc. (I dont know of such a hard case which updates O(N) states but I think it can exist)

      • »
        »
        »
        »
        5 years ago, # ^ |
          Vote: I like it +21 Vote: I do not like it

        It is O(26*N*N). At every state, try to append a letter to the current string. To find the correct $$$j$$$ value, you can use prefix function/kmp automaton.

        • »
          »
          »
          »
          »
          5 years ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          Yes, you are right. Though 26*N*N cant pass N = 10000 but It seems that 26*N*N never occurs. What will be the actual worst case ??

          • »
            »
            »
            »
            »
            »
            5 years ago, # ^ |
            Rev. 2   Vote: I like it 0 Vote: I do not like it

            nishant403 That 26 is for what will be the next character, for all the states, you have to consider all the possible next characters. And we are using KMP/Z algorithm to precompute the next possible value of J' for dp[i][j] if the j + 1 the character does not match with next character we are trying to append with

            • »
              »
              »
              »
              »
              »
              »
              5 years ago, # ^ |
                Vote: I like it 0 Vote: I do not like it

              Consider the code Here

              The part where I construct the matrix, I think it will take very less time than actual time complexity because most of the transitions will share 0 characters to prefix.

»
5 years ago, # |
Rev. 2   Vote: I like it 0 Vote: I do not like it

3) Let DP[i] denote number of strings of length i such that T ends at i for the first time.

(At last we can add DP[i]*26^(N-i) to get all strings which containt string T and subtract it from 26^N. )

Initialize DP[i] to be 26^(i-len(T)) for i >= len(T).

Now we subtract from it all cases where string can end before i. (Similar to this )

Note that there may be indices j > i — len(T) from which we can update (preprocess needed)

  • »
    »
    5 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    nishant403 I have high doubts regarding the correctness of this solution. Let's consider this problem to one given in the link, Then we have got a straight 1xN single row matrix, with all the blocked cells at (1,i) where i >= len(T). Now, the problem is we are blocking all the blocked cells, but there might be cases when some blocked cells can be available cells when few are already blocked before that..

    • »
      »
      »
      5 years ago, # ^ |
      Rev. 2   Vote: I like it 0 Vote: I do not like it

      I don't understand the case you are talking about.

      Here is my code. Input format is same as lightOJ's question (without testcases).

      I got AC in lightOJ using this approach for n < 1000.

      Time complexity : O(N*N)

»
5 years ago, # |
Rev. 2   Vote: I like it +43 Vote: I do not like it

I have solved some similar problem sometime ago and would like to share some interesting approaches to this problem.

1) Solution 1: Complexity- O(|S|*|T|*26)

DP[i][j] -> currently set the first 'i' elements of S and 'j' is the length of the longest suffix of S which is a prefix of T. Transition would require you to iterate over all the characters and calculate the updated longest suffix. These transitions can be preprocessed in O(|T|*26) using the prefix function and stored in another dp, TRANS[j][c] where j is the current suffix, c is the character you add after the current suffix and TRANS[j][c] would store the length of longest updated suffix. Note that you should not perform the transition where TRANS[j][c] = |T|. For the bonus part, you can maintain another state which will store the number of times you visit j = |T|.

2) Solution 2: O(|T|^3*log(|S|))

We will create a transition matrix for T and use matrix expo over the length of S to find the answer. Construct a matrix M, where M[i][j] will store the number of characters s.t. the longest suffix changes from i to j on adding the character (again use prefix function on T) and skip those transitions where j = |T| or i = |T|. If you exponentiate this to the power of N, you will realize that the answer is sum of M[0][j] where 0 <= j < |T|.

3) What if you are given multiple patterns ( say P1, P2, ..., Pk) and you have to calculate the number of strings S of length N s.t. none of the patterns occur in the string ?

Solution: We will use aho corasick to solve this problem in O(|N|*|sum of pattern length|). Build aho corasick over all the patterns and mark those nodes where atleast one pattern ends. We will consider these nodes as the states and make transitions only when we are not ending at any marked node. The new dp state becomes dp[i][j] where i is the length of the current string and j is the state in the aho corasick. Transitions will be similar as in Solution 1. You can also use Solution 2 and solve the problem in O(|sum of pattern length|^3*log(|S|)) too.

Here are some problems which uses the last approach : Problem Problem

»
5 years ago, # |
Rev. 2   Vote: I like it 0 Vote: I do not like it

EDIT: What if we want to find the number of string S which contains given string T as a substring at most K times?

Again DP[i][len(Alphabet)]. You can make a transition function via KMP, and it will be len(T) * (K + 1), (each len(T) block is offset by K * (T)). You don't want to reach the final state, so if you have string abc, and you want to have it max 2 times you have this — [0,0,0,3,3,3,6,6,6], which means if you already have matched 'abc' once, each "invalid" character brings you back to state 3, not 0.

  • »
    »
    5 years ago, # ^ |
    Rev. 2   Vote: I like it 0 Vote: I do not like it

    zed_b Your explanation is a bit unclear. But I got your point. whenever we are matching at position j and if the character does not match, we reach (j/k) * j the position instead of 0 after applying KMP. maybe that would work. I think that is same as dp[N][K][len(T)]