Diego's blog

By Diego, 3 years ago, translation, In English

TL;DR

If you code on python, use set or dict, and want to avoid hacks, you can read only the last section.

Introduction

Most people interested in hacking know about the method of creating tests targeting unordered containers in C++. This method is described in more detail in post. However, creation of similar tests for set and dict in python is covered much less well, which I decided to fix with this post.

A little theory about hash tables

I will not go into detailed descriptions and proofs, because there are plenty of them in the open sources. Let us consider only what interests us in this case.

An open-addressing hash table is a data structure that stores a set of elements in an array of knownly larger size, defining the position of the element as its hash taken modulo the size of the array. If hashes of several elements give the same cell in the array, the hash table tries to take another cell according to some rule, which depends on the particular implementation of the table. Usually this rule boils down to checking the cells $$$f(x), f(f(x)), f(f(f(x)))$$$ and so on, until an empty one is found. The function $$$f$$$ is often a linear transformation $$$(a * x + b) \% size$$$, where $$$size$$$ is the size of the array, and $$$a$$$ and $$$b$$$ are relatively prime with it.

Python

Python uses a hash table with an array size equal to a power of two to implement dict, and the transformation is slightly more complex than a simple linear one -- $$$f(x) = (5x + 1 + pertrub) \% size$$$, where $$$pertrub$$$ is initially equal to hash, but is devided by 32 in each step. For a detailed implementation, see repository.

Also, the hash function for numbers in python is very predictable, it's just the number itself.

Thus, to make a countertest it is enough to find a sequence of indexes that will be searched for a particular number and preoccupy them, and then provoke a search for that number in the dictionary, which is quite easy to do for most problems.

The set implementation in Python3 is a bit more complicated, it does not test a single cell, but 10 consecutive cells, which can be observed here. However, building a test in this case is not very difficult either, just add $$$x, x + 1, x + 2, ..., x + 9$$$ instead of one number $$$x$$$.

Tests

I used 153032429 (Thanks to turkids for it) and 153408991 from Codeforces Round 781 (Div. 2) to check. The result can be found as hacks number 796358 and 796362. Judging by the "unknown verdict" result, it went too well and I broke the author's solution at the same time. More details may know shishyando and I_love_teraqqq.

How to protect yourself?

Unlike C++, Python does not provide a way to define its own hash function for an existing type (or I just don't know about it), but nobody prevents you from defining your type with a different hash function:

from random import getrandbits

RANDOM = getrandbits(32)

class Wrapper(int):
    def __init__(self, x):
        int.__init__(x)
    def __hash__(self):
        return super(Wrapper, self).__hash__() ^ RANDOM

An example of using this type can be found in 153409562. Unfortunately, while the author's solution breaks on my tests I will not be able to test the robustness of my solution with the Wrapper class. However, locally it works fast enough.

  • Vote: I like it
  • +178
  • Vote: I do not like it

| Write comment?
»
3 years ago, # |
  Vote: I like it 0 Vote: I do not like it

I can't view the hacks

Image
»
3 years ago, # |
  Vote: I like it 0 Vote: I do not like it

This is very interesting. Did Python developers use SipHash for string keys in order to defend against Hash DoS, but left integer keys unprotected?

from Python's configure.ac
  • »
    »
    3 years ago, # ^ |
    Rev. 4   Vote: I like it 0 Vote: I do not like it

    It's probably due to overhead on short keys. Rust, for example, uses SipHash-1-3 by default on all keys, however it does have a lot of overhead for short keys (like integer keys very common in competitive programming problems), e.g. compare:

    153429932 Siphash-1-3, 4694 ms

    150361213 custom hash function based on NASAM, 2167 ms

    • »
      »
      »
      3 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      There's also aHash available for Rust: https://github.com/tkaitchuck/aHash

      And an unfinished discussion here about its implementation details when AES instructions are not available: https://github.com/tkaitchuck/aHash/issues/106

      • »
        »
        »
        »
        3 years ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        Yeah I've noticed it

        But there's no way to pull its crate for CF, and when I tried implementing it for myself it's much slower than my custom hash, at least on CF; perhaps it's not configured correctly to do hardware-accelerated AES

        • »
          »
          »
          »
          »
          3 years ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          This is interesting, because your write_u64 hasher function doesn't exactly look lightweight with 3 multiplications and a bunch of shifts/rotates. I suspect that the specialized code path for short keys in aHash fallback should do fewer calculations. But this needs to be confirmed. Also AES is another part of the puzzle. Codeforces hardware is modern enough to support AES.

          • »
            »
            »
            »
            »
            »
            3 years ago, # ^ |
            Rev. 3   Vote: I like it 0 Vote: I do not like it

            To be clear, what I tried to implement was the AES version (using unsafe to call the AES intrinsic), not the fallback. My function is probably slightly slower than the AHash fallback, however I am somewhat less confident about the fallback's statistical properties

            Edit: Also, only two of those multiplies are actually performed. The update to the dither (self.1) gets optimized away when the hash function only calls write_u64 once per key.

»
3 years ago, # |
  Vote: I like it +13 Vote: I do not like it

Also check this.

  • »
    »
    3 years ago, # ^ |
      Vote: I like it +19 Vote: I do not like it

    Thank you, I haven't seen it before. Also I tested my method for PyPy and it work as well as with cpython. Submissions: 153445268 and 153445556.

    • »
      »
      »
      3 years ago, # ^ |
      Rev. 2   Vote: I like it 0 Vote: I do not like it

      Why is Wrapper(i)=i? I checked for a few integers and it gives the same value!! Can I use this method in dictionaries for two different data types?

      • »
        »
        »
        »
        3 years ago, # ^ |
        Rev. 2   Vote: I like it +5 Vote: I do not like it

        Wrapper is inherited from int, so it can be used as regular integer (but type(Wrapper(...) + Wrapper(...)) == int!). I do not know what would happen if you mix Wrappers and ints in same dict, but it can be combined with other types, like str, tuple, etc.

        • »
          »
          »
          »
          »
          3 years ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          Ohh. Now I get it. type(Wrapper(i)) is different. Thanks! What does int.__init__(x) mean in init function in Wrapper Class.

          • »
            »
            »
            »
            »
            »
            3 years ago, # ^ |
              Vote: I like it +5 Vote: I do not like it

            Those two init lines don't do anything and should just be removed. They are nonsense.

»
3 years ago, # |
  Vote: I like it +16 Vote: I do not like it

I can confirm this hash hack works in both PyPy2 and PyPy3. The hack is simple enough that any Python user should expect to be hacked using this.

As for the work around. I personally think a better thing to do is to use

from random
RANDOM = random.randrange(2**62)

def Wrapper(x):
  return x ^ RANDOM

This is arguably not as nice, but it should run a lot faster/use less memory in PyPy than using a custom class. Also wrapping int fails for big integers in PyPy2, I'm not entirely sure as to why big ints are a problem.

  • »
    »
    3 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Are non-int (str, tuple, ...) keyed dictionaries safe against this type of hack?

    If not what would an efficient protection be?

    • »
      »
      »
      3 years ago, # ^ |
        Vote: I like it +3 Vote: I do not like it

      Hash of str is randomized, so it looks quite hard to make predictable collision. But I haven't yet researched how tuple hashes work, so they can be vulnerable.

»
3 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Anyone who was here before me want to comment on how this went for unordered_map? Seems like a pretty short path from esoteric/specific hack that I might not even mind getting hit with individually to... thing that caused FSTs and complaints, and therefore a thing to be pre-empted by inclusion in pretests.

I dunno. I grok and mostly agree with notions like "it is the responsibility of the competitor to know what they're using, even if choices are implied/constrained, they're still algorithmic decisions and therefore fair game" hence the 'not even mad' sentiment in the individualized scenario above. I guess maybe that masochism breaks down at the point where it's no longer up to the would-be hacker to execute the hack, and/or it only takes one successful hack to get replicated across the entire contest...?

Think I answered my own question. Doooo we like it this way though? I don't know at what point something becomes a bad barrier to entry, feels subjectively like we tend to run towards that point wherever it is though...

  • »
    »
    3 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    I have seen very few people hack unorederd_map, including me.

    In the last round, tests against unordered_map were included in the main testset and I belive it is well, because it gives beginners a lesson that they should not blindly rely on the tools of the language, but should be aware of how these tools work. Better that they find out about it in the form of a dropped task than in the form of a DoS attack on the service they will one day create.

    And if the creators of the round did not provide some types of tests, it can always be done by participants in the form of hacks.

»
3 years ago, # |
Rev. 4   Vote: I like it 0 Vote: I do not like it

Well, I would suggest this approach, but it slows down the solution, which is already a huge problem for python.

import random

RANDOM = random.randrange((1 << 31) - 1)

class Dict(dict):
    def __setitem__(self, __k, __v):
        return super().__setitem__(__k ^ RANDOM, __v)

    def __delitem__(self, __v):
        return super().__delitem__(__v ^ RANDOM)

    def __getitem__(self, __k):
        return super().__getitem__(__k ^ RANDOM)

    def __contains__(self, __o: object) -> bool:
        return super().__contains__(__o ^ RANDOM)
»
3 years ago, # |
Rev. 2   Vote: I like it +15 Vote: I do not like it

Nice hack! Actually this hack can be solved by converting the number into string. Here is my new solution, only 20% slower than my original solution. 153489121

I suggest every python user to look at this new post, otherwise your code may be hacked in the future since dictionary is used so frequently in the contest.

»
3 years ago, # |
  Vote: I like it +7 Vote: I do not like it

I opened up an issue about this over at PyPy. link

»
3 years ago, # |
  Vote: I like it 0 Vote: I do not like it

It happened again in a div 4 problem H (Gambling)

»
3 years ago, # |
  Vote: I like it 0 Vote: I do not like it

can we just change int to string while storing it as key in dictionary... ? this can fix this error..

»
3 years ago, # |
  Vote: I like it 0 Vote: I do not like it

How does the "solution" help, since it keeps the collisions?

hash(a) = hash(b) is the same as hash(a)^RANDOM=hash(b)^RANDOM..

  • »
    »
    3 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    This solution is not prevent collisions itself, it prevent collision-chains like hash(x) = hash(a_1), f(hash(x)) = hash(a_2), f(f(hash(x))) = hash(a_3), .... Xor with RANDOM changes every element and broke the chain.