Hello everybody!
If you want to use binary search over array with known at compile time size, and this size can be extended to power of two, you can use next simple template:
template<int n, typename T>
int lowerBound(const T* __restrict arr, const T val) {
if (n == 1) { return val > arr[0]; }
else {
constexpr int mid = n / 2;
if (val > arr[mid]) { return mid+lowerBound<n-mid, T>(arr+mid, val); }
else { return lowerBound<mid, T>(arr,val); }
}
};
Testing on ideone shows, that it is faster in 4 times in comparison with std::lower_bound
.
Thanks for reading this.
UPD. Testing for doubles, in 3.5 times faster.
There is an issue with how you do your testing.
The compiler could make use of the fact that the inner loop doesn't depend on i. I tried removing the i for loop and switching to $$$2 ^ {23}$$$ queries, see here. The times are now much more similar. 1.56 s vs 1.19 s.
This testing method was implemented with goal to reduce cost of loading queries from memory to cache. If compiler will use fact, that inner loop doesn't depend on i, then only one iteration will be executed, because there is not JIT-compilation in GCC. We can see, that runtime is a lot, so, I think, that $$$2^{15}$$$ iterations were executed.
Another test with xorshift32 random generator shows 1.68 s vs 1.08 s.
First of all: Your results are very much skewed towards lowerBound. This could be because of several reasons
1) You first execute std::lower_bound thus filling data into caches perhaps allowing better execution for the following iteration. When trying to prove that something is slower than something, isolate things. Probably execute on different CPU cores and such too. I'd suggest making two separate binaries to avoid such effects and run them multiple times
2) Welp, you know this could be because your code invokes undefined behavior? You pretty much attempt to store $$$2^{20} \cdot 2^{23}$$$ in
int res
. You can test it by adding#pragma GCC optimize ("trapv")
so that your code exits on int overflow (but don't use it when comparing performance results)3) By initializing static variables and other stuff you skew the results towards the compile-based solution. This is pretty easy to isolate in this case by writing a generator that pipes exactly the same data into your program whilst not being compile-time available.
That said, I've tried isolating all the stuff I've mentioned, and I still see 2-3x times performance gain and I have no fucking idea why is it so fast, this is amazing and I wonder if anybody knows why.
I've read some assembly, couldn't understand any of the optimizations whatsoever but the only thing I noticed is that it actually does execute all queries, it doesn't appear to be cached anyhow in the compiled binary
I will tell you.
The first one and the most impressive one you can see on the image below:
Here compiler copies all 255 array elements, which can be accessed in the first 8 steps of binary search, to the stack memory, so they may be accessed without cache misses. And it's wow! I didn't expect such deep level of optimization logic from the any of compilers!
But the sad part of this story is that this happened only because many iterations of the same test are performed, and the compiler knows this in advance. In most of real problems, such an optimization cannot be expected. So the test itself is really not so good.
The second optimization is quite simple and obvious: compiler groups several subsequent calls of template function, so size and indices are known for these steps in compile time, and checks of binary search look like sequence of
cmp, jmp, cmp, jmp
without any indexing math.PS. My test: https://ideone.com/AL5ZNd
Ah, so it is caching after all.
Think practical implementation of such binary search would look like this (but this particular optimization should cache last N calls instead if first ones) :https://codeforces.me/blog/entry/65278#comment-492839
But I couldn't prove the idea of cache locality via valgrind, so I didn't think in that direction. Can you elaborate on how to prove that such optimization improves L1 hits by some cpu counting tool?
You're right that there's caching, but it's not caused by allocating and copying to the stack, but by the fact that just accessing this memory puts it to L1 cache. The stack is good because it's often faster to compute addresses, memory access speed only depends on locality of the data you're accessing.
You are right and a little bit wrong in the same time =) Accessing the memory really puts it to L1 cache, but there is no guarantee that it will remain there. There is an optimization here exactly because of locality of copied data, so instead of accessing so sparsely located elements of 8 steps of binsearch, we do that on copied data, which is located densely. And it really doesn’t matter to us whether it is a stack memory or not, I just stated the fact of this case.
Actually we can use such optimization of binary search by ourselves, and it was described in this blog (only RU) 9 years ago. There, the author used three levels of data and got 60 percent speedup.
Yes, we're saying the same thing then. It just sounded like you said that the stack helps caching.
(Technically, data doesn't have to be densely packed to fit in cache. It just has to be packed on distinct cache lines — in practice, this is some requirement on lower bits of their addresses. This is architecture-dependent, so I guess the compiler makes sure it's not a concern.)