assembly - Does using mix of pxor and xorps affect performance?

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I've come across a fast CRC computation using PCLMULQDQ implementation . I see, that guys mix pxor and xorps instructions heavily like in the fragment below:
movdqa  xmm10, [rk9]
movdqa  xmm8, xmm0
pclmulqdq xmm0, xmm10, 0x11
pclmulqdq xmm8, xmm10, 0x0
pxor  xmm7, xmm8
xorps xmm7, xmm0
movdqa  xmm10, [rk11]
movdqa  xmm8, xmm1
pclmulqdq xmm1, xmm10, 0x11
pclmulqdq xmm8, xmm10, 0x0
pxor  xmm7, xmm8
xorps xmm7, xmm1
Is there any practical reason for this? Performance boost? If yes, then what lies beneath this? Or maybe it's just a sort of coding style, for fun?
                xorps is a three-byte instruction, while pxor takes four bytes. Other than that, Agner Fog's instruction tables and microarchitecture manuals indicate that it doesn't hurt on AMD, since the xorps is treated as integer-domain. This could hurt performance on pre-skylake Intel though, as xorps can't use as many execution units there, and there may be bypass delays.
– EOF
                Oct 3, 2016 at 8:39
                @EOF: I'm guessing it's tuned for Intel SnB/IvB, based on the date (and that it's written by Intel).  Alignment for the uop cache seems like the best guess, but maybe there's something going on with avoiding a resource conflict to not delay the next PCLMUL.
– Peter Cordes
                Oct 3, 2016 at 10:00
TL:DR: it looks like maybe some microarch-specific tuning for this specific code sequence.  There's nothing "generally recommended" about it that will help in other cases.
On further consideration, I think @Iwillnotexist Idonotexist's theory is the most likely: this was written by a non-expert who thought this might help.  The register allocation is a big clue: many REX prefixes could have been avoided by choosing all the repeatedly-used registers in the low 8.
XORPS runs in the "float" domain, on some Intel CPUs (Nehalem and later), while PXOR always runs in the "ivec" domain.
Since wiring every ALU output to every ALU input for forwarding results directly would be expensive, CPU designers break them up into domains.  (Forwarding saves the latency of writing back to the register file and re-reading).  A domain-crossing can take an extra 1 cycle of latency (Intel SnB-family), or 2 cycles (Nehalem).
Further reading: my answer on What's the difference between logical SSE intrinsics?
Two theories occur to me:
Whoever wrote this thought that PXOR and XORPS would give more parallelism, because they don't compete with each other.  (This is wrong: PXOR can run on all vector ALU ports, but XORPS can't).
This is some very cleverly tuned code that creates a bypass delay on purpose, to avoid a resource conflicts that might delay the execution of the next PCLMULQDQ.  (or as EOF suggests, code-size / alignment might have something to do with it).
The copyright notice on the code says "2011-2015 Intel", so it's worth considering the possibility that it's somehow helpful for some recent Intel CPU, and isn't just based on a misunderstanding of how Intel CPUs work.  Nehalem was the first CPU to include PCLMULQDQ at all, and this is Intel so if anything it'll be tuned to do badly on AMD CPUs.  The code history isn't in the git repo, only the May 6th commit that added the current version.
The Intel whitepaper (from Dec 2009) that it's based on used PXOR only, not XORPS, in its version of the 2x pclmul / 2x xor block.
Agner Fog's table doesn't even show a number of uops for PCLMULQDQ on Nehalem, or which ports they require.  It's 12c latency, and one per 8c throughput, so it might be similar to Sandy/Ivybridge's 18 uop implementation.  Haswell makes it an impressive 3 uops (2p0 p5), while it runs in only 1 uop on Broadwell (p0) and Skylake (p5).
XORPS can only run on port5 (until Skylake where it also runs on all three vector ALU ports).  On Nehalem has 2c bypass delay when one of its input comes from PXOR.  On SnB-family CPUs, Agner Fog says:
  In some cases, there is no bypass delay
  when using the wrong type of shuffle or Boolean instruction.
So I think there's actually no extra bypass delay for forwarding from PXOR -> XORPS on SnB, so the only effect would be that it can only run on port 5.  On Nehalem, it might actually delay the XORPS until after the PSHUFBs were done.
In the main unrolled loop, there's a PSHUFB after the XORs, to set up the inputs for the next PCLMUL.  SnB/IvB can run integer shuffles on p1/p5 (unlike Haswell and later where there's only one shuffle unit on p5.  But it's 256b wide, for AVX2).
Since competing for the ports needed to set up the input for the next PCLMUL doesn't seem useful, my best guess is code size / alignment if this change was done when tuning for SnB.
On CPUs where PCLMULQDQ is more than 4 uops, it's microcoded.  This means each PCLMULQDQ requires an entire uop cache line to itself.  Since only 3 uop cache lines can map to the same 32B block of x86 instructions, this means that much of the code won't fit in the uop cache at all on SnB/IvB.  Each line of the uop cache can only cache contiguous instructions.  From Intel's optimization manual:
  All micro-ops in a Way (uop cache line) represent instructions which are statically contiguous in the code and have their EIPs within the same aligned 32-byte region.
This sounds like a very similar issue to having integer DIV in a loop: 
Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs.  With the right alignment, you can get it to run out of the uop cache (the DSB in Intel performance counter terminology).  @Iwillnotexist Idonotexist did some useful testing on a Haswell CPU of micro-coded instructions, showing that they prevent running from the loopback buffer.  (LSD in Intel terminology).
On Haswell and later, PCLMULQDQ is not microcoded, so it can go in the same uop cache line with other instructions before or after it.
For previous CPUs, it might be worth trying to tweak the code to bust the uop cache in fewer places.  OTOH, switching between uop cache and legacy decoders might be worse than just always running from the decoders.
Also IDK if such a big unroll is really helpful.  It probably varies a lot between SnB and Skylake, since microcoded instructions are very different for the pipeline, and SKL might not even bottleneck on PCLMUL throughput.
                This code was written for Westmere, as the PDF itself claims. I happen to think that it was probably the PhD student who wrote it, and that he did not know precisely what he was doing. Evidence: 1. Random use of pxor/pxor instead of pxor/xorps. 2. No use of mov[au]ps for memory loads. 3. Awful regalloc, esp. of xmm10, increasing nearly all insn sizes by 1 byte. 4. pclmulqdq takes 18 uops on Westmere, best-case throughput is 1 every 8c, and encodes w/ prefixes to 7 bytes, so micro-optimizations like alignment and port scheduling are very premature. pxor/xorps here is cargo cult.
– Iwillnotexist Idonotexist
                Oct 6, 2016 at 1:22
                @IwillnotexistIdonotexist: hrm, yeah I noticed some questionable register allocation, too.  But the code was copyright 2011-2015, while the PDF was published in 2009.  The PDF makes no reference to this implementation of the code, which is why it's plausible that this was written with a later CPU in mind.  Esp. since we're only seeing the 2015 version, not even the 2011 version.  But yes, it doesn't look like good code.  It's still possible that this somehow helps on some CPU, but I think your theory is probably more likely, and it's just crap.
– Peter Cordes
                Oct 6, 2016 at 1:34
                I basically wrote this answer as a thinking-out-loud brain-dump while entertaining the possibility that this wasn't just stupid.
– Peter Cordes
                Oct 6, 2016 at 1:34
                @IwillnotexistIdonotexist: My question-feed is mostly just assembly / x86 / sse / avx / computer-architecture / lock-free / stdatomic, so I see them all and answer the good ones. There are lots of people writing good C / C++ answers, and I'd never be able to keep up with the volume of questions in those tags.  (Plus, when I do look at micro-optimization C/C++ questions, writing up an answer usually takes a really long time, comparing the asm for all the random good and bad ideas from other answers.  So I limit myself to a question feed I can keep up with, because I can't not look at things.)
– Peter Cordes
                Oct 6, 2016 at 1:46
                Although I am starting to train myself to just move on from the really boring newbie asm questions asking a minor variation on the same question for the hundredth time, and the stupid walls of 16-bit DOS code with 5 different bugs.  It boggles my mind why people think it's ok to bother others with their problems when they haven't even used a debugger.
– Peter Cordes
                Oct 6, 2016 at 1:53
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.