I recently found about the asm
keyword in c++ that allows us to write inline assembly in c++. Is there any way to cheese solutions by it? I googled but they say "constants don't matter, focus on complexity.." I KNOW, THANK YOU. I want to know will it make my n^3 pass in n<=1000
asm reference
Compiler already performs a lot of optimizations and ultimately C++ produces assembly code before the machine code. Writing assembly would only be viable if you have a lot of time (which you don't have in contests) AND provided that you are expert at assembly language.
Mostly not.
It's only marginally faster, you can choose the same speed if you use a better algorithm. Inline asm can be long to type. And difficult too, imagine doing DP and having to deal with stack & frame pointers, not fun. Do it because you want to get better at assembly not for faster cf solutions. And don't do them for contests.
No it won't. Modern compilers optimize code well enough. I sometimes rewrite my code in assembly but only gain ~1.3 speed increase. BTW, if you like such optimizations, take a look at SIMD intrinsics, they may halve the execution time in some cases.
Just for some more info about SIMD parallelism, if you write code in a compiler-friendly way (alignment of data -- as well as removing data dependencies -- in a manner that allows the compiler to optimize using SIMD parallelism), then it's quite reasonable to expect the compiler to use SIMD wherever possible, so the OP doesn't even need to worry about writing code using SIMD at all.
What follows is some general advice for squeezing in solutions within strict time limits, after you've done all possible optimizations like bitsets/replacing data structures/fast io.
What could possibly lead to a better performance boost in many practical cases would be to try to avoid branches (and thereby reducing checks) as much as possible. I would personally not recommend doing this, since one extreme of this would be using goto in your code where you are guaranteed to satisfy certain conditions, which can be confusing and harder to debug in some cases. A practical way of avoiding branches is to directly encode conditions into the answer (for a very simple example, if you want ceil(x / 2), you can try (x >> 1) + (x & 1) or (x + 1) >> 1 instead of checking for x modulo 2).
Another more important thing is the cache-friendliness of your algorithm, but unfortunately (or fortunately), most solutions to the same problem have similar cache-friendliness, so that doesn't make a lot of difference in CP, but you can try to ensure that you have the least number of cache misses possible if you're trying to squeeze in your solution. There was a discussion about the implementation of a sparse table some time ago, which showed how significant cache misses really are (most of the time taken by a program is spent waiting for the data, and L1-caches are orders of magnitude faster than L3-caches, so you can get an idea about this effect).