# | User | Rating |
---|---|---|
1 | jiangly | 3976 |
2 | tourist | 3815 |
3 | jqdai0815 | 3682 |
4 | ksun48 | 3614 |
5 | orzdevinwang | 3526 |
6 | ecnerwala | 3514 |
7 | Benq | 3482 |
8 | hos.lyric | 3382 |
9 | gamegame | 3374 |
10 | heuristica | 3357 |
# | User | Contrib. |
---|---|---|
1 | cry | 169 |
2 | -is-this-fft- | 165 |
3 | Um_nik | 161 |
3 | atcoder_official | 161 |
5 | djm03178 | 157 |
6 | Dominater069 | 156 |
7 | adamant | 154 |
8 | luogu_official | 152 |
9 | awoo | 151 |
10 | TheScrasse | 147 |
Name |
---|
as a co-author, i helped making the meme and the post.
please upvote if you find the post helpful. we might post more about technical stuff in the future.
Finally a blog that explains behind the scenes for pragma spells black magic. Thanks ! :pkinglove:
Very nice post. Most people in the competitive programming community are using pragmas very imprudently like if it was some magic, so demystifying them is extra important.
One thing to note is that most SIMD extensions are backwards compatible, so you only need to include the latest one instead of the
sse,sse2,sse3...
mess that people often use (except for the situational ones likepopcnt
). So just leaving the following two cover your needs in 95% of the cases:Also, it may or may not be beneficial to ask your compiler to tune your code for a specific architecture. GCC isn't very good at that, but it almost certainly won't make things worse:
All in all, I guess for competitive programming this setup covers almost all and has a very low chance of backfiring:
Also, I wrote a broader article on the topic, and it has some other compiler optimizations if someone is interested.
Thanks for the comments, the backwards compatibility in most people's code is quite a mess (hence it warranted its own place in the meme :)).
I tend to avoid
tune=native
since it sometimes breaks some stuff locally/on the judge (for instance, compilation error here: https://godbolt.org/z/53c877nbh ). I haven't tried that yet on CF though. Also I tend to not use Ofast since I've heard about people's code breaking on geometry problems, and a couple of times it led to much slower code than O3 (though it's quite rare, so it should be fine anyway).Your book that you linked is pretty nice by the way, hoping to see more stuff like that!
You explained why sse, mmx etc are redundant because of avx2. But why omit the bit manipulation stuff like popcnt, lzcnt, abm, bmi, bmi2? Can you give an example of when these can "backfire"? Also it seems like fma is not always covered by avx2 either right?
I am tempted to use this as my default now:
(and @nor, it would super useful if you had a correct copy-and-pastable snippet in your tldr. If I googled your blog post during the contest I would probably end up copying the first block of code I saw, which is the one that does nothing lol)
EDIT: sslotin I just started looking at your book and it's fucking amazing! I highly recommend anyone who was interested enough to click on this blog post to read it too.
I think tune=native should be used cautiously; you should check if it compiles.
Also regarding the copy-and-pastable snippet, I mentioned in the blog that it depends quite a bit on the specific judge machines, so it's not exactly copy-and-pastable. And if anything, it should already be in your template so you don't need to copy paste from this blog post :)
However I added the pragmas that I tend to use on Codeforces just for easy reference. Hopefully that helps.
By backfire I mean to cause the program to either crash or slow down significantly, or in general to behave differently on your laptop and on the testing system (without necessarily causing runtime errors). The more hardware-specific options and optimizations are enabled, the higher the chances of that happening.
To be safe you need to find out the exact microarchitecture the server is running, look up the list of its extensions and intersect it with what you have locally. I did this for CodeForces a while back when I was curious, and found out it was running a Skylake or something very similar, which has these extensions listed in the docs:
There are a lot of them, and I don't really know what a third of them does or which are included in which. The reason I recommend only leaving
avx2
is because it is the only thing that is almost guaranteed to also be enabled on your workstation (unless you are an Apple M1 user) and also probably the only extension that really matters — I doubt anyone has ever been bottlenecked by the throughput of popcount, to be honest.Actually, I think for CodeForces in particular this is what you should to do when submitting:
That's it, nothing else except optimization flags. This not only enables all available extensions and only them, but also tells the compiler additional info about the exact microarchitecture such as instruction latencies (this option implies
tune=skylake
). But be aware that if you leave this pragma uncommented when testing locally, there is a small chance it will crash or behave differently unless you are also running a Skylake.On some onsite contests like the ICPC WF it is guaranteed that the testing servers will be identical to the workstations, so you can query the platform with
g++ -march=native -Q --help=target | grep march
and replaceskylake
with that.(I have not tested it though. Please someone try it and tell if I am wrong.)
There's a quick way to check if the targets you're interested in are supported on an unfamiliar online judge:
assert(__builtin_cpu_supports("<target-name>"));
will throw a runtime error if the CPU doesn't support that target.For an onsite contest, since you would have access to machines with the same configuration as the judging machines, it's nicer to run
g++ -march=native -E -v - </dev/null 2>&1 | grep cc1
and see which targets work (-mno-avx
means avx is not supported,-mavx
means avx is supported).Specifying architectures inside pragmas is kind of hopeless on GCC, for instance,
#pragma GCC target("arch=skylake")
(or anything trying to pass arch/tune parameters to the target) gives a compilation error: https://godbolt.org/z/K5xTeG3PvLook at the errors:
attribute value 'arch=skylake' was already specified in 'target' attribute
. Some concurrency-related parts of the GCC C++ library are using exactly this method for performance, but I guess they are conflicting with ours because you can't specify architecture twice or something.If we put that pragma after the library import this seems to work: https://godbolt.org/z/dWYPjfnWP. Not sure which parts of the standard library will be optimized in this case though.
UPD: it also works without changing anything in older (≤8) GCC versions.
Yeah, as far as I can see, it suffices to keep only iostream before the import to avoid the compilation error. Too bad that
bits/stdc++.h
doesn't work here.I tried making sure that some STL functions get optimized, but unfortunately couldn't manage to do that: https://godbolt.org/z/rh59TGbPE (so as of now I can't find a way to use
arch=skylake
fruitfully).I think this is a bug in GCC 9. When defining hardware-specific library implementations, you are supposed to use
#pragma GCC push_options
with target, implement a function or a module, and then pop it back (which is a way to do target-inside-target), but in that one file that causes the error it is not done like that.It is easy to fix, but for already installed compilers there is nothing we can do unfortunately. Still works on the C++17 CF compiler and most other platforms though.
Polygon claims to judge on the i3-8100, so targeting Coffee Lake might be ideal. Though as far as I know Coffee Lake has no drastic architectural changes from Kaby Lake or Skylake, so it may not lead to a noticeable benefit over targeting Skylake.
I did some investigation, and yes, you are right: it is in fact Coffee Lake. And it shows the same model number running the gist I linked above.
![](https://i.ibb.co/rQn889h/Screenshot-from-2021-10-27-21-40-59.png)
However, as you noted, there aren't many architectural changes from Kaby Lake or Skylake. In fact, in terms of optimization-relevant features, there aren't any at all, so GCC does not distinguish between them and simply groups them all as "skylake".
Norosity
As if there is a second upvote button, I would smash it to this blog.
smash yourself instead :>
No thank you. I already smashed my rating down to pupil few days ago :>
norz
norz
Lets not forget ToxicPie9 also co-authored the blog !
ToxicPie9 orz
norzosity
Let's petition MikeMirzayanov to simply add
-march=native
(or-march=skylake
or whichever architecture invokers are running on) to the compilation flags the next time he adds a new C++ compiler (preferably Clang). Then all thetarget
pragmas can be omitted completely, only leaving optimization options which are situational.This ensures that the compiler knows everything about the hardware while not increasing compilation time — in fact, it will probably save a lot of CPU resources (on average by 10-20% on all submissions would be my guess). I have no idea why it is not done so on all online judges.
Heartfelt thanks from the bottom of my heart.
nor orz :prayge:
About AVX2. These instructions are also notorious for downclocking the CPU, depending on the number of CPU cores that are using AVX2 simultaneously. The exact effect of this varies depending on the CPU model. See https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/ (it's mainly about AVX-512, but AVX2 also has similar properties to some extent).
So on the Codeforces platform, the AVX2 pragmas may be potentially not very neighbour-friendly among other things. Imagine a quadcore judge machine, which processes 4 submissions in parallel. If 3 of them are actively using AVX2, then it's possible that the CPU may downclock and affect the performance of the unsuspecting 4th non-AVX2 submission too. Again, this may be a non-issue depending on the CPU model, but it would be great if Codeforces could try to do some experiments and benchmarks in this scenario.
The best solution may be just to disable AVX2 instructions on the judge machines.
As far as I'm aware AVX2 is nowhere near the power hog that AVX512 is.
See this analysis for some rough numbers in some CPU limited workloads. It is for rocket lake, so its applicability to most judges that are still using the skylake-like architectures isn't 1:1, but it still paints a rough picture of the effect of AVX/AVX2 on power draw.
Also I would highly expect that any decent judge would completely disable core turbos and/or take other measures to mitigate noisy neighbor issues even in the case of full loading of across all cores. And I do believe no judge right now supports AVX-512 for this exact reason.
Hi nor and ToxicPie9,
In the problem CF Edu. Round Problem B. I faced an issue, first I wrote the code without using any pragmas and it was accepted. 133543512
Then I tried with some pragmas
#pragma GCC optimize("Ofast,unroll-loops") #pragma GCC target("avx2,popcnt,lzcnt,abm,bmi,bmi2,fma,tune=native")
On using them it displayed wrong in testcase 6.133542940.
Can you or anyone please explain, why this was happened. Any help will be very helpful. Thanks
PS: The error was caused by 2nd pragma, but I don't know why that was happening.
I think the issue is unrelated with pragmas;the solution should be incorrect even without them. Update: it is indeed related to pragmas in a weird way, see the comment belowOn line 88, you calculated the answer of
ceil(rem / k)
by dividing two doubles. However, thedouble
type can only store 52 bits of precision. Since both numbers can be close to 1e18 in magnitude, when you convert them to doubles, you might lose some precision and their values might change. (For example, if you convert the value10000000000000001
($$$10^{16}+1$$$) intodouble
, it becomes10000000000000000
($$$10^{16}$$$).) If one of the valuesrem
ork
changes during conversion, you are likely to get a wrong answer.The more precise way to calculate the answer is to do integer division instead -- for positive integers $$$a, b$$$, the exact result of $$$\lceil\frac{a}b\rceil$$$ can be calculated in C++ with:
a % b == 0 ? a / b : a / b + 1
a / b + (a % b != 0)
1 + (a - 1) / b
In fact, I do not understand why one of your solutions got accepted... Maybe the compiler did something suspicious? There's no issue on my end :thonk:Thanks ! For your nice and to the point explanation <3.
Update: Seems like it is caused by an "excess precision" issue that happens on specific compilers on Codeforces. See https://codeforces.me/blog/entry/78161 for more details.
After some testing, only "GNU G++17 (7.3.0)" on Codeforces seem to make the code use extra precision and output the correct answer. In this case pragmas is actually related to the problem: targets like AVX2 support 64-bit floating points, which fixes the excess precision issue and causes the submission to get WA correctly.
Are there any useful pragmas for controlling stack frame sizes in recursive functions for optimizing MLE? (today's div3's 1607F - Робот на доске 2 was a good example, and also this past discussion)
It doesn't seem like
#pragma GCC optimize("Os")
does much. Looking at the list of gcc optimization flags, it does seem like this is something you can control but I don't know how to specify them as pragmas.From: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
-fconserve-stack Attempt to minimize stack usage. The compiler attempts to use less stack space, even if that makes the program slower. This option implies setting the large-stack-frame parameter to 100 and the large-stack-frame-growth
You can use
#pragma GCC optimize("conserve-stack")
for that. To check if this pragma works, you can use the compiler flags mentioned in the blog. In general, if you have an option like-fsomething
, it makes sense to try#pragma GCC optimize("something")
and see if it works.Thanks! Unfortunately it doesn't seem like
conserve-stack
did anything for my problem. I want to try the other flags on that page too but I still can't figure out the pragma syntax for stuff like--param=large-stack-frame-growth=100
.In particular I am trying to get a recursive solution for 1607F to pass. I did get it working for C++17 32 bit, but not for 64 bit (neither C++20 nor C++17): https://codeforces.me/contest/1607/submission/134144039
Do you have any tips?
Why didn't you include tzcnt target in your pragmas list?
tzcnt is an instruction that is supported in the bmi instruction set, so it isn't needed to be mentioned separately. Also, the last I checked, it gave me an unrecognized target error.
Can anyone explain which pragma function causes error and how ?
Solution 1-> Pragma on https://codeforces.me/contest/1737/submission/175313327
Solution 2-> Pragma off https://codeforces.me/contest/1737/submission/175313262
It is probably due to some pragma changing how excess precision (in this case) is handled by the C++ compiler on Codeforces you are using (which is 32-bit, and there are a lot of weird things that happen with precision with the Codeforces 32-bit C++ compiler), but I might be wrong. It is probably because the compiler could be performing some optimizations in the code without pragmas but not the one which has the pragmas (the pragmas might be telling the optimizer to not optimize that code but do some other optimization instead), that it is able to keep extra precision in the code without the pragmas.
Ideally, your code should have gotten WA both times, since you are using
std::sqrt
on along long
which causes loss of precision. So there is nothing wrong with the pragmas (you were just relying on a compiler specific behavior). Submitting on a 64-bit compiler validates the above theory, the relevant submission can be found here: 175318712.Thanks
Very informative elaborating what is happening underneath the hood. I reached this blog from google. It helped me get this 552D - Ваня и треугольники from 1600 ms to 1270 ms using a brute-force approach.