Hi, I'm attempting this problem with string hashing.
It works well for small inputs but gives wrong answer on very large inputs.
Any help on what could be wrong?
Problem : https://codeforces.me/contest/271/problem/D
Submission: https://codeforces.me/contest/271/submission/46239564
I think you just need to use double hashing to avoid collision .
How exactly?
Also, how can I tell I need to use double hashing? I mean, when is it necessary?
Furthermore, I've seen solutions to this problem with just one normal hashing :(
I mean with double hahsing is to use two hash values for the string with two different base and MOD values .
In general you can't tell when will a single hash solution will pass the test cases for a problem as the collision happens with a probability and you can't tell if your solution will collide or not but you can reduce the probability of collision as much as you can .
You can calculate this probabilty by assuming that the hash values will be uniformly distrubted over the different values of strings so as much as you increase the value of the MOD you will gain more probability of getting ACC (less probability of collision) or by using double hashing for solutions based on rolling hash in your case .
Thanks, it worked. However, how could I tell I needed the double hashing before submitting?
You don't need to detect when you should use 2 or more hashes. One could say you should do according to your intuition, but I suggest always using multiple hashes, depending on how memory and time consuming it is to build this many hashes. Say, 2 or 3 is the usual amount I use.
The following is an accepted solution based on collision-free substring hashing. The main idea is to enumerate small letters between a and z as integers between 0 and M - 1, where M = 26. Then, up to P consecutive symbols in the string are packed in a single integer as digits of a base-M integer using iterated multiplication and addition without overflow, and P = 13 for a 64-bit signed integer. The sequence of integers generated from packing a substring represents a collision-free hash key for all substrings with the same length. A two-dimensional array of hash-key sets is used to store the distinct keys generated from all substrings in the input string, where the first index represents the number of bad letters in the substring and the second index represents the length of the substring. It is guaranteed that two substrings are different if the number of bad letters they contain are different or their lengths are different. In other words, all substrings stored in one item of the two-dimensional array have the same number of bad letters and the same length.
46247924
UPDATE:
The following is an update for the previous solution using one-dimensional array to store the collision-free hash key (using the second index only of the previous solution, i.e. the substring length). This update improved both the execution time and memory used.
46257737
Always Use double hashing if possible. The probability of collision in single hashing is N/MOD. While Using double hashing the probability of collision becomes (N*N/MOD*MOD1). In case of worst case, N/MOD might become 10e-4 which will lead you to trouble. Instead while using double hashing, In the worst case, the probability of collision will remain 10e-8 at least.
I think that the birthday paradox is a convenient way to measure this: if we generate something like random integers from 0 to MOD - 1, the probability of collision will be somewhere near 0.5. So if you want to make a lot of string comparisons using 32-bit hashing, the probability of collision is high (and it becomes even higher assuming there are multiple tests, and you should pass all of them).
Taking two (or three) 32-bit hashes or one (or two) 64-bit hash should be enough almost in every problem.