its_aks_ulure's blog

By its_aks_ulure, history, 5 years ago, In English

I was preparing some problems for a college contest from the past few weeks and came up with this problem idea.

Given N words consisting of either lowercase English letters or a special character '#'. 
Find the expected number of nodes in the trie after a trie is constructed replacing all the special character('#') in the words uniformly randomly with any letter between 'a' to 'z'. 

I want to know if there exists any polynomial time solution for this problem.

Thanks

  • Vote: I like it
  • +28
  • Vote: I do not like it

»
5 years ago, # |
  Vote: I like it +16 Vote: I do not like it

Bump! Anyone who can help me?

»
5 years ago, # |
Rev. 2   Vote: I like it +29 Vote: I do not like it

Suppose we are given an ordered list of strings $$$S$$$. Assume the words are inserted in the order in which they appear in $$$S$$$. By linearity of expectation, the final answer is

$$$ \sum_\limits{i=1}^n \sum_\limits{j=1}^{|S_i|} P(i, j) $$$

where $$$P(i, j)$$$ is the probability that character $$$j$$$ of string $$$S_i$$$ is not already in the trie when it is added. Without loss of generality, assume that $$$S_{ij}$$$ is not #, since if it is, we can just add up the probabilities for all characters a-z and divide by $$$26$$$. (edit: this isn't necessary, the algorithm below works perfectly fine if $$$S_{ij}$$$ is #)

The probability that $$$S_{ij}$$$ is not in the trie, is simply the product of the probabilities that $$$S_i$$$ does not share a prefix of length $$$j$$$ with $$$S_k$$$, for all $$$k<i$$$. For fixed $$$i,j,k$$$, we just do casework on the two prefixes. If they disagree anywhere, the answer is automatically $$$1$$$. Otherwise, the probability that they end up agreeing is $$${\frac 1{26}}$$$ to the power of the number of indices where at least one string has a #. Subtract from $$$1$$$ to get the probability that they disagree.

Let $$$m$$$ be the length of each string, and let $$$a=26$$$. The total runtime is $$$O(an^2m)$$$, which I'm sure can be optimized further, but is indeed a polynomial time solution. This is also just a rough sketch of a solution and may have errors, I welcome corrections in comment replies.

  • »
    »
    5 years ago, # ^ |
      Vote: I like it +32 Vote: I do not like it

    I don't believe the claim "The probability that $$$S_{ij}$$$ is not in the trie, is simply the product of the probabilities that $$$S_i$$$ does not share a prefix of length $$$j$$$ with $$$S_k$$$, for all $$$k<i$$$", because the events could be dependent.

    Consider the case where $$$S_1=\text{ab},S_2=\text{ac}$$$, $$$S_3=\text{#}$$$ and $$$j=1$$$.

    • »
      »
      »
      5 years ago, # ^ |
        Vote: I like it +6 Vote: I do not like it

      Yes, I see. So we would actually have to iterate over all possible configurations of $$$S_i$$$ achievable by replacing # with a-z, which is $$$O(a^m)$$$ in the worst case.

      If we ignore the case of having a # in previous strings before $$$S_i$$$, we can permute the indices up to $$$j$$$ to float all of the # characters to the beginning, then delete all elements of $$$S$$$ before index $$$i$$$ which disagree at at least one index, since they contribute nothing to $$$P(i, j)$$$. So, without loss of generality, we can assume $$$S_i$$$ contains nothing but # characters, and the problem reduces down to "given a bunch of strings of length $$$j$$$, what is the probability that a uniformly randomly chosen string of length $$$j$$$ is not equal to any of the other strings?" This can be answered easily, the answer is $$$1-\text{# of unique strings}/a^j$$$.

      So, if we ignore the case of # occurring in other strings, then we can solve it in polynomial time. We have a solution for when $$$S_i$$$ contains no # character, and all other strings contain # characters. We also have a solution for when $$$S_i$$$ contains a # characters, and all other strings contain no # character. Is it possible to combine these two solutions somehow to get a general case answer?

      I think the answer is no, because if we try to perform the same trick of separating $$$S_i$$$ into # components and non-# components, which have potentially $$$2^n$$$ duplicate-deletion scenarios, as opposed to only 1 in the previous case. However, this is still an improvement over the $$$a^m$$$ time we had before. So, I now think that the problem is NP-hard.

»
5 years ago, # |
  Vote: I like it +17 Vote: I do not like it

Would love to see the problem added to a judge by someone (or maybe it already exists?).

»
5 years ago, # |
  Vote: I like it +14 Vote: I do not like it

I would suspect this problem is NP-hard, just like most counting problems that boil down to fixing some order of elements. I have no idea how to (dis)prove that, though, so I might be terribly wrong here.