Qualified's blog

By Qualified, history, 4 years ago, In English

I see this article and at the bottom, it says that modulus operators are expensive so they implemented a slightly faster version of Euclidean Algorithm. Why not make a more efficient mod?

int mod(int a, int b) { // computes a % b;
	return (a - b * (a / b));
}
  • Vote: I like it
  • +87
  • Vote: I do not like it

| Write comment?
»
4 years ago, # |
  Vote: I like it +72 Vote: I do not like it

afaik / and % are expensive compared to + and *

»
4 years ago, # |
  Vote: I like it +219 Vote: I do not like it

Cause / is approximately as expensive as %.

»
4 years ago, # |
  Vote: I like it +130 Vote: I do not like it

Actually on x86 division instruction DIV (or IDIV for signed integers) computes both quotient and reminder, just stores them in different registers. So obviously your "more efficient" version can't work faster than a % b;

»
4 years ago, # |
  Vote: I like it +32 Vote: I do not like it

I liked your blog! As explained in the comments, apparently this would not be faster, but I agree with you that mod operator are famously said to be the ones that are slow, I never stopped to think about the division operator, that should be as slow too! I’m sorry that your blog got downvoted. It made me have a better understanding of predicting the run time of a code

»
4 years ago, # |
Rev. 2   Vote: I like it 0 Vote: I do not like it

Intel wants to know your location.......

On a serious note, it will be helpful to see not them as O(1) operations on 32 bit but asymptotic complexity on variable number of bits. Then you'll appreciate that modulus isn't much of a different problem than division.

»
4 years ago, # |
  Vote: I like it -48 Vote: I do not like it

You can do a binary lifting/binary search mod operation. I really don’t know whether it’s faster or not.

»
4 years ago, # |
Rev. 2   Vote: I like it +60 Vote: I do not like it

https://godbolt.org/z/7W35Me

Spoiler
»
4 years ago, # |
  Vote: I like it 0 Vote: I do not like it

You could get a speed up sometimes by doing this:

a = a >= b ? a % b : a;

The more versatile option that always works is, write assembly instructions to perform the modulus operation. It gives quite a bit of speed up.

  • »
    »
    4 years ago, # ^ |
    Rev. 2   Vote: I like it +4 Vote: I do not like it

    If this always works and gives "quite a bit of speed up", why doesn't the C++ compiler just do that too?

    • »
      »
      »
      4 years ago, # ^ |
      Rev. 2   Vote: I like it 0 Vote: I do not like it

      Not sure. But the key is to always do a benchmark. We cannot trust compilers to do magic all the time can we? Anyway, I have personally tried using the assembly trick before and it worked pretty well.

      • »
        »
        »
        »
        4 years ago, # ^ |
          Vote: I like it +4 Vote: I do not like it

        Can you share your results? Would be nice to see the methodology and the final numbers to understand better.

        • »
          »
          »
          »
          »
          4 years ago, # ^ |
            Vote: I like it +95 Vote: I do not like it

          bruh your shirt is orange again

        • »
          »
          »
          »
          »
          4 years ago, # ^ |
          Rev. 4   Vote: I like it +65 Vote: I do not like it

          Disclaimer: Please try not to bash me for the sample size. I don't have time to record data for different problems (but I have tried the trick on several other problems before). Hence, I have only presented one below. You may perform your own benchmarks too. Lastly, the assembly code does not belong to me, I shamelessly peeled it off Kaist's online ICPC team notebook).

          Firstly, here is a sample problem from CF.

          To give some context, I read about a failing submission for this problem from Petr's blog due to a large number of modulo operations. In particular, this sentence:

          maroonrk's C failed due to trying to squeeze $$$10^8$$$ modulo operations in the time limit

          The optimization(s)
          I only modified this line of code in the main function (which was identified by Petr to be causing the TLE):

          code

          Note that I have used 2 optimizations (I call them optimization 1 and optimization 2 below):

          Optimization 1 refers to writing modulo in the form a = a >= b ? a % b : a;
          Optimization 2 refers to writing modulo in assembly code.


          The benchmarks

          The original code which uses modulo operation naively. Result: TLE on test 49 (>1000ms).

          Using optimization 1 with C++11: Code. Result: TLE on test 51 (>1000ms)

          Using optimization 1 + 2 with C++11: Code. Result: TLE on test 51 (>1000ms)

          Using optimization 1 with C++17 (32-bit): Code. Result: TLE on test 51 (>1000ms)

          Using optimization 1 + 2 with C++17 (32-bit): Code. Result: AC (982ms). Due to the closeness of the runtime to the time limit, I submitted twice to be sure. Both submissions yielded the same runtime.

          Using optimization 1 with C++17 (64-bit): Code. Result: TLE on test 51 (>1000ms)

          Using optimization 1 + 2 with C++17 (64-bit): Code. Result: AC (545ms)

          Oh, yes. Here is the modulo subroutine in case you are interested to test it out yourself:

          Code
    • »
      »
      »
      4 years ago, # ^ |
        Vote: I like it +7 Vote: I do not like it

      If this always works and gives "quite a bit of speed up", why doesn't the C++ compiler just do that too?

      Because you waste time on comparison and branching. Similarly, it isn't easy to say if sort() should first check if the sequence is already sorted and then finish in $$$O(n)$$$.

      • »
        »
        »
        »
        4 years ago, # ^ |
        Rev. 2   Vote: I like it 0 Vote: I do not like it

        You are talking about the conditional if a >= b version that LanceTheDragonTrainer said sometimes works. I was asking about the assembly version that LTDT said always works.

    • »
      »
      »
      4 years ago, # ^ |
        Vote: I like it +19 Vote: I do not like it

      The compiler has to make sure to produce code that works correctly for all possible int values, we don't. In particular for this case I believe there are some odd corner cases if you allow numbers to be negative.

      Just tell the compiler that you are modding unsigned integers, and you get a code that runs at around the same speed (slightly faster, even) than Lance's assembly version: 88112311

  • »
    »
    4 years ago, # ^ |
      Vote: I like it -9 Vote: I do not like it

    FYI Branches are much more expensive than integer division/modulo operators.

    • »
      »
      »
      4 years ago, # ^ |
        Vote: I like it +17 Vote: I do not like it

      Implementing addition of two values $$$a, b \in [0, P-1]$$$ as return a+b<P ? a+b : a+b-P; is actually faster than (a+b)%P.

      • »
        »
        »
        »
        4 years ago, # ^ |
        Rev. 2   Vote: I like it +13 Vote: I do not like it

        Benchmarks:
        14700593 CPU ticks with a+b<P ? ... : ...
        11126168 CPU ticks with just (a+b) % P

        Code
        • »
          »
          »
          »
          »
          4 years ago, # ^ |
            Vote: I like it +26 Vote: I do not like it

          Your solution spends most time on computing random() % P and that includes computing that random value. Running your program multiple times gave me inconsistent results but the x?y:z version was faster by a few percents usually.

          The x?y:z version is more than twice faster if it's really a bottleneck of a solution https://ideone.com/8m0qWb (0.56s vs. 1.46s)

          • »
            »
            »
            »
            »
            »
            4 years ago, # ^ |
              Vote: I like it -15 Vote: I do not like it

            Well, your solution spends most time on data access :)

            Actually it does not matter where most time is spent if it is the same for both versions because you always can subtract it from total times and compare the rests.

            BTW, I think we need another test.

          • »
            »
            »
            »
            »
            »
            4 years ago, # ^ |
              Vote: I like it -14 Vote: I do not like it

            Kamil, unfortunately your test code cannot be used for benchmarking branch-misses (details under spoilers).

            (a+b)%P
            (a+b)<P
      • »
        »
        »
        »
        4 years ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        Branches are not expensive if there's a pattern that branch predictor can learn. While implementing addition most of the time the result will not overflow so a predictor which outputs false would be good enough for you.

        There are lot of things at play like speculative execution and other low level CPU stuff. Modern CPU are quite complicated to make a rule of thumb.

  • »
    »
    8 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    this doesn't always produce correct result

»
4 years ago, # |
Rev. 2   Vote: I like it +3 Vote: I do not like it

I don't know if this works, but it might help. It's a fast way to reduce a%b under some loose constraints (Barret Reduction).

  • »
    »
    4 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    So smart your link is!

  • »
    »
    4 years ago, # ^ |
    Rev. 2   Vote: I like it +10 Vote: I do not like it

    slightly faster than origin %: 827ms vs 643ms in my computer

    #include <bits/stdc++.h>
    #define watch(x) std::cout << (#x) << " is " << (x) << std::endl
    using LL = long long;
    constexpr LL M  = 1e9 + 7;
    constexpr int  k = std::__lg(M) + 2;
    constexpr LL m = (1LL << k) / M;
    
    const int N = 1e8 + 2;
    LL fac[N];
    void init1(){
    	fac[0] = 1;
    	for (int i = 1; i < N; ++i) fac[i] = fac[i - 1] * i % M;
    }
    void init2() {
    	auto mod = [&](LL &a) {
    		LL r = a - ((a * m) >> k) * M;
    		if (r >= M) r -= M;
    	};
    	fac[0] = 1;
    	for (int i = 1; i < N; ++i) mod(fac[i] = fac[i - 1] * i);
    }
    int main() {
    	//freopen("in","r",stdin);
    	std::ios::sync_with_stdio(false);
    	std::cin.tie(nullptr);
    
    	auto start1 = std::chrono::high_resolution_clock::now();
    	init1();
    	auto end1 = std::chrono::high_resolution_clock::now();
    	std::cout << "Time used: " << std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count() << " (ms)" << std::endl;
    
    	auto start2 = std::chrono::high_resolution_clock::now();
    	init2();
    	auto end2 = std::chrono::high_resolution_clock::now();
    	std::cout << "Time used: " << std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count() << " (ms)" << std::endl;
    
    	return 0;
    }
    
»
4 years ago, # |
  Vote: I like it +13 Vote: I do not like it
»
4 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Well, I see a quite different scenario with python.

In [1]: %timeit (503043530435 % 232039042)
Out[1]: 8.24 ns ± 0.207 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
In [2]: %timeit mod(503043530435, 232039042)
Out[2]: 231 ns ± 8.43 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

I used IPython's %timeit to calculate the time difference and found mod(a, b) to be more expensive than %.

  • »
    »
    4 years ago, # ^ |
      Vote: I like it -12 Vote: I do not like it

    Can you check the runtime of this

    int mod(int a, int b) {
        return a >= b ? a % b : a;
    }
    

    BTW, this was proposed by LTDT

    • »
      »
      »
      4 years ago, # ^ |
        Vote: I like it +8 Vote: I do not like it

      This is usually quite unhelpful because most of the time in the worst case mods are required at every step, and if you are at the point where this is the difference between AC and TLE, then it is probably better to remove unnecessary mods (i. e. just mod once after several additions, rather than after each one).

      This also has overhead caused by the potential branch (which, in general, should probably always be used). The article linked by sys. provides a small speedup, but for most cases is probably not necessary.