ankurrathinsit's blog

By ankurrathinsit, history, 4 years ago, In English

So as the title says, given a binary string of length n (1<=n<=10^5) , find the number of unique decimals that can be obtained from subsequences of the binary string:

Example : 101

0=0

1=1

01 (same as 1) so rejected

10= 2

101= 5

11= 3

So, number of unique decimals are 5. I can only think of recursive solution.

How to approach this question ?

Btw it was asked today in Google hiring contest.

  • Vote: I like it
  • +30
  • Vote: I do not like it

| Write comment?
»
4 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Do we have to find the answer mod something?

»
4 years ago, # |
Rev. 4   Vote: I like it 0 Vote: I do not like it

So the maximum valid subsequence would look like this (if we only had m = 3 ones):

1 0 0 0 0 ... 0 0 1 0 0 ... 0 0 1 0 0 ... 00

Of course, the amount of zeroes between two ones can be zero.

Let a[i] be the number of zeroes between the ith one and i + 1th one (where i is from 1 to m). Here in our case m = 3, a[3] would be the number of zeroes after the third one, to simplify formulas.

The amount of valid subsequences containing exactly m = 3 ones would be product of every a[i] from 1 to 3.

However the problem is of course more complicated than that. We have to think a way so that we don't count some possible subsequences several times, while realising that the number of 1s can be anywhere from 0 to m.

Let's take any subsequence of that earlier form, not just the maximal (according to count of 1s) one. So we have at most let's say k <= m.

Let's call b[j] the number of zeroes from jth to j + 1th (j is from 1 to k) one that we actually include in our subsequence. I'm thinking about being extra careful about the size of b[j]. I know if I do some calculus for b[j] once I will likely be able to be careful about counting some subsequences twice, while if I don't choose something that lets me set in stone the interval of b[j] it will be much harder to count some subsequences exactly once. Which is why I favour going from one end to another (here I'll choose right to left) of the given sequence, trying to construct the count based off the count for the last 1s (and 0s after them).

Initially, we'd just have a[m]. Then when we try to decide what to do with a[m — 1], there are two variants:

  1. Count both mth one and m — 1th one, which means simply a[m — 1] * a_m

  2. Count just the m — 1th one, adding the a[m] ones, making a[m — 1] + a_m.

Let's call the amount of different subsequences from the end to the ith term (i <= m) as d[i]

This then ends up being generalised for a[i]:

  1. Count both ith and i + 1th one, which means simply a[i] * d[i + 1]

  2. Now it gets tricky. You actually have to count the subsequence that doesn't include the i + 1th one, or doesn't include any one from i + 1 to a certain point, let's say p. However, if we imagine the amount of zeroes between i + 1th one and pth one as a[i + 1, p] then

d[i] = (a[i + 1, p] + a[i]) * d[p].

which you can rewrite as

d[i] = a[i + 1, p] * d[p] + a[i] * d[p]

You would need to take in consideration many ps. But you can keep a sum of the d[p]s you passed while walking through the array and then do a[i] * sum(d[p]) until i. And you can do a similar thing for a[i + 1, p] * d[p].

This should ensure that in d[1] you will have counted every possible combination exactly once. It should require just an iteration through the whole array, therefore being O(n).

  • »
    »
    4 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    i see your solution is similar to mine :).

    Also there's a mistake on the Count both ith and i + 1th one, which means simply a[i] * d[i + 1] part. When we count both ones we can choose any number from 0 to a[i] zeroes, so it's acutally (a[i] + 1)*d[i], but the part talking about not using both i and i+1 is correct because you you don't use at least 1 zero you will be repeating previous sequences

    • »
      »
      »
      4 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Oh yeah I forgot. Classic off-by-one bug amirite?

»
4 years ago, # |
Rev. 2   Vote: I like it 0 Vote: I do not like it

Ok so i'm not 100% sure if this solution is correct. I coded it and tested with a lot of random input but n had to be very small (n<=15) because i checked it using a brute force that is O(2^n).

EDIT: Actually after writing the proof out im pretty confident this is correct

Obs 1: every unique sequence except for {0} starts with a 1. Therefore, i will do the subsequent steps assuming that and adding the sequence {0} in the end if there are any zeroes in the string.

Obs 2: we can always make all completely unique sequences by appending a prefix that starts on index i to a previously found unique sequence that starts on index j > i (we assume that an empty sequence is a valid sequence so that this assumption works).

Let dp[i] be the number of unique sequences that start on i and is different from every unique sequence that starts on j > i. We start with dp[n+1] = 1 (it's the empty sequence i mentioned on obs 2) and go to dp[1]. The answer will be the sum of all dp on indexes <= n, and as said on obs 1, we may need to add one more sequence for {0}. Also, for the sake of convinience, let's define f(i) to be the next 1 on the initial string after the index i.

If we're trying to calculate dp[i] and we have z zeroes between the next 1 in the string (on index f(i)), we run into 2 cases.

1- we are going to use the 1 on f(i): the number of sequences will be (z+1)*dp[f(i)], because we can append the 1 on index i in addition to 0 to z zeroes at the left of all sequences starting on f(i). 2- we are not going to use the 1 on f(i): in that case it's like we're removing it from the sequence completely. To be able to not confict with any sequences previously started on i, we need to make sure that the ammount of between i and f(f(i)) is bigger than the ammount of zeroes between f(i) and f(f(i)), since if this is not the case we will be repeating sequence. We will need to use a number from 1 to z of our zeroes in addition to the all extra zeroes in already between f(i) and f(f(i)), instead of the from 0 to z in the last case. In fact, by fulfilling that condition, we can make entirely new sequences not only with f(f(i)), but with ALL j > f(i), because no previous sequence could match that number of zeroes we achieve by doing so. Therefore, the number of sequences that DON'T use f(i) will be z*{sum of all dp[x] such that f(i) < x <= n+1}.

Knowing that, we can actually not physically actually build a dp array, but instead we can store a variable storing dp[f(i)] and another one storing the summation of all dp of an index bigger than f(i).

Like that we solve this problem in O(n) time and O(1) space. Sheesh

I really liked this problem, it was really fun to think about even though it looks like a textbook problem. I would also really like to submit my answer somewhere so that i can be certain everything is ok. Anyway, you can ask any questions if you want, bye :)