c - why glibc memcpy not choose avx512 version?

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams int a[S], b[S], c[S]; __attribute__((target_clones("avx512f", "avx2","arch=atom","default"))) void foo(int argc){ int i,x; for (x=0; x<1024; x++){ for (i=0; i<S; i++){ a[i] = b[i] + c[i]; b[0] = argc; memcpy(&a[0], &b[0], argc *sizeof(int)); int main(int argc, char** argv) { foo(argc); return 0;
which call memcpy;
from the objdump, we can found it will call GLIBC memcpy:
#readelf -r a.out 
Relocation section '.rela.dyn' at offset 0x418 contains 1 entry:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000403ff8  000200000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0
Relocation section '.rela.plt' at offset 0x430 contains 4 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000404018  000100000007 R_X86_64_JUMP_SLO 0000000000000000 __libc_start_main@GLIBC_2.2.5 + 0
000000404020  000200000007 R_X86_64_JUMP_SLO 0000000000000000 __gmon_start__ + 0
000000404028  000300000007 R_X86_64_JUMP_SLO 0000000000000000 memcpy@GLIBC_2.14 + 0
000000404030  000000000025 R_X86_64_IRELATIV                    4018f0
then, i use gdb to trace which glibc implementation it used;
(gdb) b memcpy@plt    
Breakpoint 1 at 0x401050
(gdb) s
The program is not being run.
(gdb) r
Starting program: /root/a.out 
Breakpoint 1, 0x0000000000401050 in memcpy@plt ()
(gdb) s
Single stepping until exit from function memcpy@plt,
which has no line number information.
0x00007ffff7b623a0 in __memcpy_ssse3_back () from /lib64/libc.so.6
(gdb) info function __memcpy_*
All functions matching regular expression "__memcpy_*":
Non-debugging symbols:
0x00007ffff7aa2840  __memcpy_chk_sse2
0x00007ffff7aa2850  __memcpy_sse2
0x00007ffff7ab1b40  __memcpy_chk_avx512_no_vzeroupper
0x00007ffff7ab1b50  __memcpy_avx512_no_vzeroupper
0x00007ffff7b23360  __memcpy_chk
0x00007ffff7b5a470  __memcpy_chk_ssse3
0x00007ffff7b5a480  __memcpy_ssse3
0x00007ffff7b62390  __memcpy_chk_ssse3_back
0x00007ffff7b623a0  __memcpy_ssse3_back
(gdb) 
there are __memcpy_avx512_no_zeroupper, but not been choosen;
and my cpu supports its feature:
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep
mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht
tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs
bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq
dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm
pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes
xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp
tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1
hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq
rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl
xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
cqm_mbm_local dtherm ida arat pln pts pku ospke flush_l1d
gcc version:
Using built-in specs. COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/root/china-gcc-10.2.0/libexec/gcc/x86_64-pc-linux-gnu/10.2.0/lto-wrapper
Target: x86_64-pc-linux-gnu Configured with: ./configure
--prefix=/root/china-gcc-10.2.0 --disable-multilib Thread model: posix Supported LTO compression algorithms: zlib gcc version 10.2.0 (GCC)
                Glibc is free software. You are allowed to download its source code, study it, and improve it
– Basile Starynkevitch
                Apr 2, 2021 at 18:00
On "mainstream" CPUs like Skylake-X and IceLake, it's only worth using 512-bit vectors at all if you use them consistently for a lot of your program's run-time, not just for an occasional memcpy.  (And also if your program will run for a long time, otherwise you're slowing down other processes that share the same physical core via context switches and/or hyperthreading.)  See SIMD instructions lowering CPU frequency for the details: you don't want occasional calls to memcpy to hold your CPU frequency down to a lower max turbo.
Using AVX-512 features with 256-bit vectors (AVX-512VL) can be worth it for some things, e.g. if masking is nice, or if you use YMM16..31 to avoid VZEROUPPER.
I'd guess that glibc would only resolve memcpy to __memcpy_avx512_no_vzeroupper on systems like Knight's Landing (KNL) Xeon Phi, where the CPU is designed around AVX-512, and there's no downside to using 512-bit ZMM vectors.  There's no need for vzeroupper even after using ymm0..15 on KNL.  In fact vzeroupper is very slow on KNL, and definitely something to avoid, hence putting no_vzeroupper in the function name.
https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S.html is the source for that version.  It uses ZMM vectors, including ZMM0..15, so if used on a Skylake/IceLake CPU it should use vzeroupper.  This version looks designed for KNL.
There would be some tiny benefit to having an AVX-512VL version that used ymm16..31 to avoid vzeroupper (to speed 32 .. 64 byte copies), without ever using ZMM registers.
And it would make sense for __memcpy_avx512_no_vzeroupper to only use ZMM16..31 so avoiding vzeroupper isn't a problem on mainstream CPUs; then it would be a usable option in code that already made heavy use of AVX-512 (and thus was already paying the CPU-frequency cost.)
                so the __memcpy_sse3 is best for my cpu? if the memory to cpy is very large, does glibc use a more high performance method like vmovntq ? hope to clarify the strategy to choose different implementation :)
– Chinaxing
                Apr 3, 2021 at 0:49
                @Chinaxing: I would have expected __memcpy_avx_unaligned_erms on your Skylake-X system, same as my Skylake desktop. (code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/…).  Perhaps your distro misconfigured glibc to omit and AVX1 / rep movsb version.  The ssse3 version avoids even 256-bit vectors.
– Peter Cordes
                Apr 3, 2021 at 1:06
                @Chinaxing: Unlikely; your glibc is new enough to have AVX512 after all!  AVX1 existed for several years before that, since about 2011.
– Peter Cordes
                Apr 3, 2021 at 1:35
                @PeterCordes as of very recently because of vzeroupper aborting transactions they added a zmm16...31 version of nearly all string functions. Memmove commit
– Noah
                Apr 4, 2021 at 0:50
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.