C++ CPU测量（缓存未命中/命中）没有意义_C++_Caching_Cpu_Performancecounter_Cpu Cache

C++ CPU测量（缓存未命中/命中）没有意义

c++ caching

C++ CPU测量（缓存未命中/命中）没有意义,c++,caching,cpu,performancecounter,cpu-cache,C++,Caching,Cpu,Performancecounter,Cpu Cache,我用于细粒度CPU测量。在我的代码中，我试图测量缓存效率基本上，我首先将一个小数组放入一级缓存（通过多次遍历），然后启动计时器，再次遍历数组（希望使用缓存），然后关闭计时器 PCM显示我的L2和L3未命中率相当高。我还使用rdtscp进行了检查，每个数组操作的周期是15个（这比访问一级缓存的4-5个周期要高得多）我希望阵列完全放在一级缓存中，不会有高的一级、二级和三级未命中率我的系统对于L1、L2和L3分别有32K、256K和25M。这是我的密码： static const int AR

我用于细粒度CPU测量。在我的代码中，我试图测量缓存效率

基本上，我首先将一个小数组放入一级缓存（通过多次遍历），然后启动计时器，再次遍历数组（希望使用缓存），然后关闭计时器

PCM显示我的L2和L3未命中率相当高。我还使用

rdtscp

进行了检查，每个数组操作的周期是15个（这比访问一级缓存的4-5个周期要高得多）

我希望阵列完全放在一级缓存中，不会有高的一级、二级和三级未命中率

我的系统对于L1、L2和L3分别有32K、256K和25M。这是我的密码：

static const int ARRAY_SIZE = 16;

struct MyStruct {
    struct MyStruct *next;
    long int pad;
}; // each MyStruct is 16 bytes

int main() {
    PCM * m = PCM::getInstance();
    PCM::ErrorCode returnResult = m->program(PCM::DEFAULT_EVENTS, NULL);
    if (returnResult != PCM::Success){
        std::cerr << "Intel's PCM couldn't start" << std::endl;
        exit(1);
    }

    MyStruct *myS = new MyStruct[ARRAY_SIZE];

    // Make a sequential liked list,
    for (int i=0; i < ARRAY_SIZE - 1; i++){
        myS[i].next = &myS[i + 1];
        myS[i].pad = (long int) i;
    }
    myS[ARRAY_SIZE - 1].next = NULL;
    myS[ARRAY_SIZE - 1].pad = (long int) (ARRAY_SIZE - 1);

    // Filling the cache
    MyStruct *current;
    for (int i = 0; i < 200000; i++){
        current = &myS[0];
        while ((current = current->n) != NULL)
            current->pad += 1;
    }

    // Sequential access experiment
    current = &myS[0];
    long sum = 0;

    SystemCounterState before = getSystemCounterState();

    while ((current = current->n) != NULL) {
        sum += current->pad;
    }

    SystemCounterState after = getSystemCounterState();

    cout << "Instructions per clock: " << getIPC(before, after) << endl;
    cout << "Cycles per op: " << getCycles(before, after) / ARRAY_SIZE << endl;
    cout << "L2 Misses:     " << getL2CacheMisses(before, after) << endl;
    cout << "L2 Hits:       " << getL2CacheHits(before, after) << endl; 
    cout << "L2 hit ratio:  " << getL2CacheHitRatio(before, after) << endl;
    cout << "L3 Misses:     " << getL3CacheMisses(before_sstate,after_sstate) << endl;
    cout << "L3 Hits:       " << getL3CacheHits(before, after) << endl;
    cout << "L3 hit ratio:  " << getL3CacheHitRatio(before, after) << endl;

    cout << "Sum:   " << sum << endl;
    m->cleanup();
    return 0;
}

编辑：我还检查了下面的代码，仍然得到了相同的未命中率（我本来希望得到几乎为零的未命中率）：

编辑2：正如有人评论的那样，这些结果可能是由于分析器本身的开销造成的。因此，我改变了代码多次遍历数组（200000000次），以分摊分析器的开销，而不是只遍历一次。我仍然得到非常低的二级和三级缓存比率（%15）。

似乎您的系统上所有内核都有二级和三级未命中

我在这里查看PCM实施：

[1] 在第1407行的

PCM:：program（）

实现中，我没有看到任何将事件限制到特定进程的代码

[2] 在第2809行的

PCM:：getSystemCounterState（）

实现中，您可以看到事件是从系统上的所有内核收集的。因此，我会尝试将进程的cpu关联性设置为一个内核，然后仅从该内核读取事件-使用此函数

coreconferstate getcoreconferstate（uint32-core）

似乎可以从系统上的所有内核获得二级和三级未命中

我在这里查看PCM实施：

[1] 在第1407行的

PCM:：program（）

实现中，我没有看到任何将事件限制到特定进程的代码

[2] 在第2809行的

PCM:：getSystemCounterState（）

实现中，您可以看到事件是从系统上的所有内核收集的。因此，我会尝试将进程的cpu关联性设置为一个核心，然后仅从该核心读取事件-使用此函数

coreconferstate getcoreconferstate（uint32 core）

您的实验（测量的while循环）只有16次迭代。可能getSystemCounterState函数的开销和扰动主导了测量。我建议将L2/LLC未命中/命中与L1命中计数器进行比较。您可能会发现，与50K L2未命中相比，您的L1命中率很少。您的实验（测量的while循环）只有16次迭代。可能getSystemCounterState函数的开销和扰动主导了测量。我建议将L2/LLC未命中/命中与L1命中计数器进行比较。您可能会发现，与50K L2未命中相比，M L1命中率很少。这是正确的-PCM默认从系统上的所有内核获取信息。这是正确的-PCM默认从系统上的所有内核获取信息。

Instructions per clock: 0.408456
Cycles per op:        553074
L2 Cache Misses:      58775
L2 Cache Hits:        11371
L2 cache hit ratio:   0.162105
L3 Cache Misses:      24164
L3 Cache Hits:        34611
L3 cache hit ratio:   0.588873

SystemCounterState before = getSystemCounterState();
// this is just a comment
SystemCounterState after = getSystemCounterState();