TL;DR
If you code on python, use set or dict, and want to avoid hacks, you can read only the last section.
Introduction
Most people interested in hacking know about the method of creating tests targeting unordered containers in C++. This method is described in more detail in post. However, creation of similar tests for set and dict in python is covered much less well, which I decided to fix with this post.
A little theory about hash tables
I will not go into detailed descriptions and proofs, because there are plenty of them in the open sources. Let us consider only what interests us in this case.
An open-addressing hash table is a data structure that stores a set of elements in an array of knownly larger size, defining the position of the element as its hash taken modulo the size of the array. If hashes of several elements give the same cell in the array, the hash table tries to take another cell according to some rule, which depends on the particular implementation of the table. Usually this rule boils down to checking the cells $$$f(x), f(f(x)), f(f(f(x)))$$$ and so on, until an empty one is found. The function $$$f$$$ is often a linear transformation $$$(a * x + b) \% size$$$, where $$$size$$$ is the size of the array, and $$$a$$$ and $$$b$$$ are relatively prime with it.
Python
Python uses a hash table with an array size equal to a power of two to implement dict, and the transformation is slightly more complex than a simple linear one -- $$$f(x) = (5x + 1 + pertrub) \% size$$$, where $$$pertrub$$$ is initially equal to hash, but is devided by 32 in each step. For a detailed implementation, see repository.
Also, the hash function for numbers in python is very predictable, it's just the number itself.
Thus, to make a countertest it is enough to find a sequence of indexes that will be searched for a particular number and preoccupy them, and then provoke a search for that number in the dictionary, which is quite easy to do for most problems.
The set implementation in Python3 is a bit more complicated, it does not test a single cell, but 10 consecutive cells, which can be observed here. However, building a test in this case is not very difficult either, just add $$$x, x + 1, x + 2, ..., x + 9$$$ instead of one number $$$x$$$.
Tests
I used 153032429 (Thanks to turkids for it) and 153408991 from Codeforces Round 781 (Div. 2) to check. The result can be found as hacks number 796358 and 796362. Judging by the "unknown verdict" result, it went too well and I broke the author's solution at the same time. More details may know shishyando and I_love_teraqqq.
How to protect yourself?
Unlike C++, Python does not provide a way to define its own hash function for an existing type (or I just don't know about it), but nobody prevents you from defining your type with a different hash function:
from random import getrandbits
RANDOM = getrandbits(32)
class Wrapper(int):
def __init__(self, x):
int.__init__(x)
def __hash__(self):
return super(Wrapper, self).__hash__() ^ RANDOM
An example of using this type can be found in 153409562. Unfortunately, while the author's solution breaks on my tests I will not be able to test the robustness of my solution with the Wrapper class. However, locally it works fast enough.
I can't view the hacks
https://pastebin.com/5nDL7s7S
This is very interesting. Did Python developers use SipHash for string keys in order to defend against Hash DoS, but left integer keys unprotected?
It's probably due to overhead on short keys. Rust, for example, uses SipHash-1-3 by default on all keys, however it does have a lot of overhead for short keys (like integer keys very common in competitive programming problems), e.g. compare:
153429932 Siphash-1-3, 4694 ms
150361213 custom hash function based on NASAM, 2167 ms
There's also aHash available for Rust: https://github.com/tkaitchuck/aHash
And an unfinished discussion here about its implementation details when AES instructions are not available: https://github.com/tkaitchuck/aHash/issues/106
Yeah I've noticed it
But there's no way to pull its crate for CF, and when I tried implementing it for myself it's much slower than my custom hash, at least on CF; perhaps it's not configured correctly to do hardware-accelerated AES
This is interesting, because your write_u64 hasher function doesn't exactly look lightweight with 3 multiplications and a bunch of shifts/rotates. I suspect that the specialized code path for short keys in aHash fallback should do fewer calculations. But this needs to be confirmed. Also AES is another part of the puzzle. Codeforces hardware is modern enough to support AES.
To be clear, what I tried to implement was the AES version (using
unsafe
to call the AES intrinsic), not the fallback. My function is probably slightly slower than the AHash fallback, however I am somewhat less confident about the fallback's statistical propertiesEdit: Also, only two of those multiplies are actually performed. The update to the dither (
self.1
) gets optimized away when the hash function only callswrite_u64
once per key.Also check this.
Thank you, I haven't seen it before. Also I tested my method for PyPy and it work as well as with cpython. Submissions: 153445268 and 153445556.
Why is Wrapper(i)=i? I checked for a few integers and it gives the same value!! Can I use this method in dictionaries for two different data types?
Wrapper is inherited from int, so it can be used as regular integer (but type(Wrapper(...) + Wrapper(...)) == int!). I do not know what would happen if you mix Wrappers and ints in same dict, but it can be combined with other types, like str, tuple, etc.
Ohh. Now I get it. type(Wrapper(i)) is different. Thanks! What does int.__init__(x) mean in init function in Wrapper Class.
Those two init lines don't do anything and should just be removed. They are nonsense.
I can confirm this hash hack works in both PyPy2 and PyPy3. The hack is simple enough that any Python user should expect to be hacked using this.
As for the work around. I personally think a better thing to do is to use
This is arguably not as nice, but it should run a lot faster/use less memory in PyPy than using a custom class. Also wrapping int fails for big integers in PyPy2, I'm not entirely sure as to why big ints are a problem.
Are non-int (str, tuple, ...) keyed dictionaries safe against this type of hack?
If not what would an efficient protection be?
Hash of str is randomized, so it looks quite hard to make predictable collision. But I haven't yet researched how tuple hashes work, so they can be vulnerable.
I see.
Thanks.
Anyone who was here before me want to comment on how this went for unordered_map? Seems like a pretty short path from esoteric/specific hack that I might not even mind getting hit with individually to... thing that caused FSTs and complaints, and therefore a thing to be pre-empted by inclusion in pretests.
I dunno. I grok and mostly agree with notions like "it is the responsibility of the competitor to know what they're using, even if choices are implied/constrained, they're still algorithmic decisions and therefore fair game" hence the 'not even mad' sentiment in the individualized scenario above. I guess maybe that masochism breaks down at the point where it's no longer up to the would-be hacker to execute the hack, and/or it only takes one successful hack to get replicated across the entire contest...?
Think I answered my own question. Doooo we like it this way though? I don't know at what point something becomes a bad barrier to entry, feels subjectively like we tend to run towards that point wherever it is though...
I have seen very few people hack unorederd_map, including me.
In the last round, tests against unordered_map were included in the main testset and I belive it is well, because it gives beginners a lesson that they should not blindly rely on the tools of the language, but should be aware of how these tools work. Better that they find out about it in the form of a dropped task than in the form of a DoS attack on the service they will one day create.
And if the creators of the round did not provide some types of tests, it can always be done by participants in the form of hacks.
Well, I would suggest this approach, but it slows down the solution, which is already a huge problem for python.
Nice hack! Actually this hack can be solved by converting the number into string. Here is my new solution, only 20% slower than my original solution. 153489121
I suggest every python user to look at this new post, otherwise your code may be hacked in the future since dictionary is used so frequently in the contest.
I opened up an issue about this over at PyPy. link
It happened again in a div 4 problem H (Gambling)
yes sir
can we just change int to string while storing it as key in dictionary... ? this can fix this error..
How does the "solution" help, since it keeps the collisions?
hash(a) = hash(b)
is the same ashash(a)^RANDOM=hash(b)^RANDOM
..This solution is not prevent collisions itself, it prevent collision-chains like
hash(x) = hash(a_1), f(hash(x)) = hash(a_2), f(f(hash(x))) = hash(a_3), ...
. Xor withRANDOM
changes every element and broke the chain.