TestPMD

但是，netperf 测试虚拟机的极限性能时，内核协议栈对网络性能损耗较大，此时，可以用 DPDK 的testpmd屏蔽虚拟机内核协议栈的差异，获取实例的真实网络性能
编译参考：
https://blog.csdn.net/qq_15437629/article/details/78146823

http://core.dpdk.org/doc/quick-start/
使用方法参考：
https://blog.csdn.net/qq_15437629/article/details/86417895
性能最优实践：
https://cloud.tencent.com/document/product/213/56300
/x86_64-native-linuxapp-gcc/build/app/test-pmd/testpmd -w  0000:04:02.0 -d ./x86_64-native-linuxapp-gcc/lib/librte_pmd_virtio.so.1.1  -- --txd=128 --rxd=128 --txq=32 --rxq=32 --nb-cores=16 --forward-mode=txonly --txpkts=64  --eth-peer=0,fa:16:3e:01:01:40  -i
./x86_64-native-linuxapp-gcc/build/app/test-pmd/testpmd -w  0000:04:02.0 -d ./x86_64-native-linuxapp-gcc/lib/librte_pmd_virtio.so.1.1  -- --txd=128 --rxd=128 --txq=32 --rxq=32 --nb-cores=16 --forward-mode=rxonly   -i
========================================= 
ftrace
 
用于查看cpu是否有抢占： 
（1）ftrace：
 echo 1 > /sys/kernel/debug/tracing/events/sched/enable
 cat /sys/kernel/debug/tracing/per_cpu/cpu1/trace |grep switch 
（2）perf sched：
 perf sched record -C 2-3 sleep 5（指定cpu）
 perf sched latency --sort max
 perf sched script 
========================================= 
一、perf top 分析CPU占用
 
1）对整体CPU分析： perf top
 2）对指定进程分析cpu占用： perf top -p pid 
perf top 可以看到开销高的热点函数， 如果需要更详细的调用分析，可以用perf record 
ps:
 echo l >/proc/sysrq-trigger 可以在dmesg打印每个核上的调用栈。 
二、perf record 分析函数调用
 
1，获取数据 
//对指定进程设置采样时间和采样频率：
perf record -g -F 99 -p "pid" -- sleep 60 //持续采样时间60s,采样频率99次/s
//查看函数详细调用栈
perf record --call-graph dwarf -o perf.data -t 【thread_id】 -- sleep 60 
//查看生产的数据，分析开销高的热点函数
perf report 
2、如果觉得可视化效果不好，可以用火焰图进一步展示 
1) perf script -i perf.data >perf.unfold //将生成数据解析
2)./stackcollapse-perf.pl perf.unfold > perf.folded //利用FlameGraph工具折叠符号
3)./flamegraph.pl perf.folded > perf.svg //生成svg图
或直接用一条命令：
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg
工具获取：来自火焰图项目地址：git clone
 https://github.com/brendangregg/FlameGraph.git 
PS: perf有时给出的callchain是错误的，这里简单说一下原因及解决方法: 
callchain时指函数的调用路径。通常我们也把它称为call trace。很多同学在使用perf看热点函数的调用路径时，都发现perf给出的callchain是一堆混乱的地址，或者给出的callchain根本不对。 
我们先来解释一下perf获得callchain的方法：如果我们需要取callchain，内核就会在采样时保存内核栈以及用户栈中的各个函数的返回地址。对函数返回地址的获取以及对整个栈的遍历，可以通过栈底指针实现。而这个栈底指针，通常会保存在EBP寄存器中。内核也正是通过EBP获得栈底指针的。 
但是，当我们利用’-O’以上的优化选项编译程序时，GCC会将栈底指针优化掉，并把EBP作为一个通用寄存器。此时，我们从EBP中读到的值就不再是栈底指针了。perf与内核获得的callchain就是错误的。 
为了解决这个问题，我们建议大家在编译应用程序的调试版本时加上编译参数“-fno-omit-frame-pointer”。该参数使得GCC在优化程序时保留EBP的栈底指针功能。也只有在这种情况下，我们获得的callchain才是正确的。 
对于优化选项“-fomit-frame-pointer”产生的优化加速比，我们后面会给出具体的说明和实验数据。但目前猜测，该选项带来的优化效果不会非常大。它在一定程度上能够减少binary文件的footprint，并带来一定的性能提升。 
在最新版本的内核中，已经支持了利用libunwind获得callchain的功能。在libunwind的支持下，可以不通过EBP来获得应用程序的callchain。此时，我们可以通过如下命令执行perf： 
#sudo perf top -G dwarf
 #sudo perf record -g dwarf 
三、perf stat 分析 cache miss
 
1、什么是 cache miss 
缓存的命中率，是CPU性能的一个关键性能指标。我们知道，CPU里面有好几级缓存（Cache），每一级缓存都比后面一级缓存访问速度快。当CPU需要访问一块数据或者指令时，它会首先查看最靠近的一级缓存（L1）；如果数据存在，那么就是缓存命中（Cache Hit），否则就是不命中（Cache Miss），需要继续查询下一级缓存。最后一级缓存叫LLC（Last Level Cache）；LLC的后面就是内存。 
缓存不命中的比例对CPU的性能影响很大，尤其是最后一级缓存的不命中时，对性能的损害尤其严重。这个损害主要有两方面的性能影响： 
第一个方面的影响很直白，就是CPU的速度受影响。我们前面讲过，内存的访问延迟，是LLC的延迟的很多倍（比如五倍）；所以LLC不命中对计算速度的影响可想而知。 
第二个方面的影响就没有那么直白了，这方面是关于内存带宽。我们知道，如果LLC没有命中，那么就只能从内存里面去取了。LLC不命中的计数，其实就是对内存访问的计数，因为CPU对内存的访问总是要经过LLC，不会跳过LLC的。所以每一次LLC不命中，就会导致一次内存访问；反之也是成立的：每一次内存访问都是因为LLC没有命中。 
更重要的是，我们知道，一个系统的内存带宽是有限制的，很有可能会成为性能瓶颈。从内存里取数据，就会占用内存带宽。因此，如果LLC不命中很高，那么对内存带宽的使用就会很大。内存带宽使用率很高的情况下，内存的存取延迟会急剧上升。更严重的是，最近几年计算机和互联网发展的趋势是，后台系统需要对越来越多的数据进行处理，因此内存带宽越来越成为性能瓶颈。 
针对cache不命中率高的问题，我们需要衡量一下问题的严重程度。在Linux系统里，可以用Perf这个工具来测量。那么Perf工具是怎么工作的呢？ 
它是在内部使用性能监视单元，也就是PMU（Performance Monitoring Units）硬件，来收集各种相关CPU硬件事件的数据（例如缓存访问和缓存未命中），并且不会给系统带来太大开销。 这里需要你注意的是，PMU硬件是针对每种处理器特别实现的，所以支持的事件集合以及具体事件原理，在处理器之间可能有所不同。。具体用Perf来测量计数的命令格式如： 
perf stat -e task-clock -e cycles -e context-switches -e migrations -e L1-dcache-loads,L1-dcache-misses,LLC-loads,LLC-load-misses -p pid
▲perf stat 输出解读如下 
▪ task-clock 
用于执行程序的CPU时间，单位是ms(毫秒)。第二列中的CPU utillized则是指这个进程在运行perf的这段时间内的CPU利用率，该数值是由task-clock除以最后一行的time elapsed再除以1000得出的。 
▪ context-switches 
进程切换次数，记录了程序运行过程中发生了多少次进程切换，应该避免频繁的进程切换。 
▪ cpu-migrations 
程序在运行过程中发生的CPU迁移次数，即被调度器从一个CPU转移到另外一个CPU上运行。 
▪ page-faults 
缺页。指当内存访问时先根据进程虚拟地址空间中的虚拟地址通过MMU查找该内存页在物理内存的映射，没有找到该映射，则发生缺页，然后通过CPU中断调用处理函数，从物理内存中读取。 
▪ Cycles 
处理器时钟，一条机器指令可能需要多个 cycles。 
▪ Cache-references 
cache 命中的次数。 
▪ Cache-misses 
cache 失效的次数。 
▪ L1-dcache-load-missed 
一级数据缓存读取失败次数。 
▪ L1-dcache-loads 
一级数据缓存读取次数。 
2、如何减小cache miss？
 第一个方案，也是最直白的方案，就是缩小数据结构，让数据变得紧凑。 
这样做的道理很简单，对一个系统而言，所有的缓存大小，包括最后一级缓存LLC，都是固定的。如果每个数据变小，各级缓存自然就可以缓存更多条数据，也就可以提高缓存的命中率。这个方案很容易理解。 
第二个方案，是用软件方式来预取数据。 
这个方案也就是通过合理预测，把以后可能要读取的数据提前取出，放到缓存里面，这样就可以减少缓存不命中率。“用软件方式来预取数据”理论上也算是一种“用空间来换时间”的策略（参见第20讲），因为付出的代价是占用了缓存空间。当然，这个预测的结果可能会不正确。 
第三个方案，是具体为了解决一种特殊问题：就是伪共享缓存。 
这个方案也算是一种“空间换时间”的策略，是通过让每个数据结构变大，牺牲一点存储空间，来解决伪共享缓存的问题。 
什么是伪共享缓存呢？ 
我们都知道，内存缓存系统中，一般是以缓存行（Cache Line）为单位存储的。最常见的缓存行大小是64个字节。现代CPU为了保证缓存相对于内存的一致性，必须实时监测每个核对缓存相对应的内存位置的修改。如果不同核所对应的缓存，其实是对应内存的同一个位置，那么对于这些缓存位置的修改，就必须轮流有序地执行，以保证内存一致性。 
比如线程0修改了缓存行的一部分，比如一个字节，那么为了保证缓存一致性，这个核上的整个缓存行的64字节，都必须写回到内存；这就导致其他核的对应缓存行失效。其他核的缓存就必须从内存读取最新的缓存行数据。这就造成了其他线程（比如线程1）相对较大的停顿。 
这个问题就是伪共享缓存。之所以称为“伪共享”，是因为，单单从程序代码上看，好像线程间没有冲突，可以完美共享内存，所以看不出什么问题。由于这种冲突性共享导致的问题不是程序本意，而是由于底层缓存按块存取和缓存一致性的机制导致的，所以才称为“伪共享”。 
举个具体的多线程cache调优 的例子来理解： 
单线程程序： 
//sig.c
#include<stdio.h>
long long s=0;
void sum(long long num);
int main() {
	sum(2000000000);
	printf("sum is %lld\n", s);
	return 0;
void sum(long long num){
	for(long long i=0; i<num; i++)
		s+=i;
未经调优的多线程程序： 
//mul_raw.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sched.h>
#include <pthread.h>
void* one(void*);
void* two(void*);
long long sum,sum1;
int main(){
        pthread_t id1, id2;
        pthread_create(&id1, NULL, one, NULL);
        pthread_create(&id2, NULL, two, NULL);
        pthread_join(id2, NULL);
        pthread_join(id1, NULL);
        sum+=sum1;
        printf("sum is %lld\n", sum);
        return 0;
void *one(void *arg){
        long long i;
        for(i=0; i<1000000000; i++)
                sum+=i;
void *two(void *arg){
        long long i;
        for(i=1000000000; i<2000000000; i++)
                sum1+=i;
编译执行一下： 
#gcc sig.c -o sig
#gcc mul_raw.c -o mul_raw -lpthread
# time ./sig
sum is 1999999999000000000
real    0m6.993s
user    0m6.988s
sys     0m0.001s
# time ./mul_raw
sum is 1999999999000000000
real    0m10.037s
user    0m18.681s
sys     0m0.000s
这就奇了，明明我们多了一个线程，反而比单线程耗时多了。这是什么缘故呢？




    
 
使用perf查看一下： 
# perf stat -e task-clock -e cycles -e context-switches -e migrations -e L1-dcache-loads,L1-dcache-misses,LLC-loads,LLC-load-misses ./sig
sum is 1999999999000000000
 Performance counter stats for './sig':
       6791.176387      task-clock (msec)         #    1.000 CPUs utilized
    15,476,794,037      cycles                    #    2.279 GHz                      (80.00%)
                 8      context-switches          #    0.001 K/sec
                 0      migrations                #    0.000 K/sec
    10,006,544,037      L1-dcache-loads           # 1473.463 M/sec                    (80.00%)
           473,734      L1-dcache-misses          #    0.00% of all L1-dcache hits    (40.01%)
            73,321      LLC-loads                 #    0.011 M/sec                    (39.99%)
            18,642      LLC-load-misses           #   25.43% of all LL-cache hits     (60.01%)
       6.791355338 seconds time elapsed
 # perf stat -e task-clock -e cycles -e context-switches -e migrations -e L1-dcache-loads,L1-dcache-misses,LLC-loads,LLC-load-misses ./mul_raw
sum is 1999999999000000000
 Performance counter stats for './mul_raw':
      17225.793886      task-clock (msec)         #    1.899 CPUs utilized
    39,265,466,829      cycles                    #    2.279 GHz                      (80.00%)
                15      context-switches          #    0.001 K/sec
                 3      migrations                #    0.000 K/sec
     8,020,648,466      L1-dcache-loads           #  465.619 M/sec                    (80.00%)
        98,864,094      L1-dcache-misses          #    1.23% of all L1-dcache hits    (40.01%)
        21,028,582      LLC-loads                 #    1.221 M/sec                    (40.00%)
         6,941,667      LLC-load-misses           #   33.01% of all LL-cache hits     (60.00%)
       9.069511808 seconds time elapsed
可以明显看出数据都是 L1-dcache-loads ，但是多线程程序的L1 cache miss 比单线程还大， cycles数也明显大了。原因就是“伪共享”： 
首先我们通过top -H以及增选Last used cpu发现系统一直将两个线程分别调度到两个core中，也就是保持线程不共享L1cache。而同一个core中的CPU是共享L1cache的，这部分NUMA知识详见： 
https://blog.csdn.net/qq_15437629/article/details/77822040 
由于sum和sum1在内存中的位置是连续的，可以想象，当线程1更改了sum并放在L1cache中（对于回写策略并不会马上写到内存中）那么这条cache line在其他的cache中都将变成无效的，也就是线程2的L1cache需要去同步线程1的cache，这将浪费大量的cycle，而且几乎每一步都要去同步这个数据，cache miss就大大提高了，耗时也就上去了。 
怎么避免这个问题呢？针对产生问题的两个原因有两种解决方案： 
方法一：将两个变量隔开，使其不在同一个cache line中，一个很土的办法是：将sum改为sum[8],这样他们就不在一个cache line（64B）中了。这一步所做的应该是通常所讲的cache对齐，而且这种方法与硬件和内核调度无关。具有较好的可移植性。 
//mul.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sched.h>
#include <pthread.h>
void* one(void*);
void* two(void*);
long long sum[8],sum1[8];
int main(){
        pthread_t id1, id2;
        pthread_create(&id1, NULL, one, NULL);
        pthread_create(&id2, NULL, two, NULL);
        pthread_join(id2, NULL);
        pthread_join(id1, NULL);
        sum[0]+=sum1[0];
        printf("sum is %lld\n", sum[0]);
        return 0;
void *one(void *arg){
        for(long long i=0; i<1000000000; i++)
                sum[0]+=i;
void *two(void *arg){
        for(long long i=1000000000; i<2000000000; i++)
                sum1[0]+=i;
编译执行如下：




    
 
# gcc mul_cacheline.c -o  mul -lpthread
linux-zvpurp:/Images/zlk/test # time ./mul
sum is 1999999999000000000
real    0m3.211s
user    0m6.289s
sys     0m0.001s
linux-zvpurp:/Images/zlk/test # perf stat -e task-clock -e cycles -e context-switches -e migrations -e L1-dcache-loads,L1-dcache-misses,LLC-loads,LLC-load-misses ./mul
sum is 1999999999000000000
 Performance counter stats for './mul':
       6523.654091      task-clock (msec)         #    1.934 CPUs utilized
    14,866,840,150      cycles                    #    2.279 GHz                      (79.35%)
                44      context-switches          #    0.007 K/sec
                 4      migrations                #    0.001 K/sec
     8,038,748,997      L1-dcache-loads           # 1232.246 M/sec                    (78.70%)
           512,004      L1-dcache-misses          #    0.01% of all L1-dcache hits    (40.57%)
            81,744      LLC-loads                 #    0.013 M/sec                    (40.67%)
            13,354      LLC-load-misses           #   16.34% of all LL-cache hits     (59.56%)
       3.373951529 seconds time elapsed
基本达到单线程耗时一半的目标。cache miss和cycles都下去了。 
方法二：将线程绑定在同一个core中，这样由于大家共享一个cache line就不会有数据不一致的问题了。我的环境cpu0和cpu36是同一个core，代码优化如下： 
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sched.h>
#include <pthread.h>
#include <errno.h>
#include <string.h>
#include <unistd.h>
void* one(void*);
void* two(void*);
long long sum,sum1;
int main(){
        pthread_t id1, id2;
        pthread_create(&id1, NULL, one, NULL);
        pthread_create(&id2, NULL, two, NULL);
        pthread_join(id2, NULL);
        pthread_join(id1, NULL);
        sum+=sum1;
        printf("sum is %lld\n", sum);
        return 0;
void *one(void *arg){
        long long i;
        cpu_set_t mask;
        CPU_ZERO(&mask);    //置空
        CPU_SET(0,&mask);
        if (sched_setaffinity(0, sizeof(mask), &mask) == -1) {
            printf("set CPU affinity failue, ERROR:%s\n", strerror(errno));
        for(i=0; i<1000000000; i++)
                sum+=i;
void *two(void *arg){
        long long i;
        cpu_set_t mask;
        CPU_ZERO(&mask);    //置空
        CPU_SET(36,&mask);
        if (sched_setaffinity(0, sizeof(mask), &mask) == -1) {
            printf("set CPU affinity failue, ERROR:%s\n", strerror(errno));
        for(i=1000000000; i<2000000000; i++)
                sum1+=i;
编译时要加上-D_GNU_SOURCE。实测效果并没有提升太多（可能是同一个core的开销导致？），而且这种方法需要针对机器优化，可移植性差。 
# time ./mul
sum is 1999999999000000000
real    0m5.172s
user    0m10.239s
sys     0m0.000s
# perf stat -e task-clock -e cycles -e context-switches -e migrations -e L1-dcache-loads,L1-dcache-misses,LLC-loads,LLC-load-misses ./mul
sum is 1999999999000000000
 Performance counter stats for './mul':
      10333.513617      task-clock (msec)         #    1.982 CPUs utilized
    23,481,125,107      cycles                    #    2.272 GHz                      (79.95%)
                23      context-switches          #    0.002 K/sec
                 4      migrations                #    0.000 K/sec
     8,016,824,860      L1-dcache-loads           #  775.808 M/sec                    (59.43%)
         1,168,405      L1-dcache-misses          #    0.01% of all L1-dcache hits    (79.05%)
           117,485      LLC-loads                 #    0.011 M/sec                    (41.07%)
            36,319      LLC-load-misses           #   30.91% of all LL-cache hits     (59.99%)
       5.213851777 seconds time elapsed
四、perf sched 分析cpu打断
 
PMD独占cpu轮询的场景， 如果出现性能抖动类问题，可以用perf sched分析cpu是否有打断，判断是否I层隔离没做好： 
perf sched record -C 1
perf sched latency --sort max
perf sched script |grep switch
perf sched timehist 
参考：
 https://www.cnblogs.com/ting152/p/13522669.html
 https://blog.csdn.net/wujianyongw4/article/details/100177974、 
https://cloud.tencent.com/developer/article/1376653
 
                    一、perf获取数据1)对整体CPU分析： perf top2)对指定进程分析cpu占用： perf top -p pid3)对指定进程设置采样时间和采样频率：perf record -g -F 99 -p “pid” – sleep 60 //持续采样时间60s,采样频率99次/s二、数据获取./perf report //查看生产的数据三、如果觉得可视化效果不好，可以用火焰图进一步展示1)perf script -i perf.data &gt;perf.unfold //将生成数据解析
				FTrace is a relatively new kernel tool for tracing function execution in the Linux kernel. Recently, FTrace added the ability to trace function exit in addition to function entry. This allows for measurement of function duration, which adds an incredibly powerful tool for ﬁnding time-consuming areas of kernel execution.
In this paper, the current state of the art for measuring function duration with FTrace is described. This includes recent work to add a new capability to ﬁlter the trace data by function duration, and tools for analyzing kernel function call graphs and visualizing kernel boot time execution.
				Introduction to Ftrace  
Adding function graph tracing to ARM  
Duration Filtering − Optimizing the discard operation  
Post-trace analysis tools  
Performance impact  
Resources
ftrace 的作用是帮助开发人员了解 Linux 内核的运行时行为，以便进行故障调试或性能分析。
最早 ftrace 是一个 function tracer，仅能够记录内核的函数调用流程。
ftrace 在内核态工作，用户通过 debugfs 接口来控制和使用 ftrace ，一般挂载在/sys/kernel/debug/tracing目录下。目录结构如下：
现在的ftrace支持多种tracer
可以通过cat/sys/kernel/debug/tracing/avail
				性能工具
Linux ftrace和perf_events（又称“ perf”命令）的各种开发中和不受支持的性能分析工具。 ftrace和perf都是内核源代码中包含Linux核心跟踪工具。 您的系统可能已经有ftrace，并且perf通常只是一个软件包添加（请参阅先决条件）。
 这些工具被设计为易于安装（很少依赖），提供高级性能可观察性并且易于使用：做一件事并做好。 该集合由Bracedan Gregg（DTraceToolkit的作者）创建。
 其中许多工具都采用了变通办法，因此可以在现有Linux内核上实现功能。 因此，许多工具都有一些警告（请参见手册页），并且在添加将来的内核功能或新的跟踪子系统之前，应将其实现视为占位符。
 这些旨在用于Linux 3.2和更高版本的内核。 对于Linux 2.6.x，请参阅警告。
这些工具在USENIX LISA 2014演示文稿中进行了介
				One of the diﬃcult tasks analyzing Real-Time systems is ﬁnding a source/cause of an unexpected latency. Is the latency caused by the application or the kernel? Is it a wake up scheduling latency or a latency caused by interrupts being disabled, or is it a latency caused by preemption being disabled, or a combination of disabled interrupts and preemption. Ftrace has its origins from the -rt patch [1] latency tracer, and still carries the capabilities to track down latencies. It can catch the maximum wake up latency for the highest priority task. This wake up latency can also be tuned to only trace real-time processes. There is a latency tracer to ﬁnd the latency of how long interrupts and/or preemption are disabled. The maximum latency is captured and you can even see the functions that were called in the mean time. Ftrace also has a rich array of tracing features that can help determine if latencies are caused by the kernel, or simply are a bi-product of an application
				3.perf原理与使用简介
本文，我们主要关心的是cache miss事件，那么我们只需要统计程序cache miss的次数即可。
使用perf 来检测程序执行期间由此造成的cache miss的命令是perf stat -e cache-misses ./exefilename，另外，检测cache miss事件需要取消内核指针的禁用（/proc/sys/kernel/kptr_restrict设置为0）。
4.cache 优化实例
4.1数据合并
有两个数据A和B，访问的时候经常是一起访问的，总是会先访
				本文介绍了数据平面开发工具包（DPDK）TestPMD应用程序，展示了如何构建和配置TestPMD，以及如何用它来检查使用DPDK的不同网络设备的性能和功能。
TestPMD是一个使用DPDK软件包分发的参考应用程序。其主要目的是在网络接口的以太网端口之间转发数据包。此外，用户还可以用TestPMD尝试一些不同驱动程序的功能，例如RSS，过滤器和英特尔以太网流量控制器（Intel Ethernet Flow Director）。
我们还将研究TestPMD运行时的命令行，命令行可用于配置...