linux在内存初始化时的高内核cpu使用率_C_Linux_Kernel_Cpu_Allocation

linux在内存初始化时的高内核cpu使用率

c linux kernel

linux在内存初始化时的高内核cpu使用率,c,linux,kernel,cpu,allocation,C,Linux,Kernel,Cpu,Allocation,在服务器上引导java应用程序时，linux内核的CPU占用率很高。这个问题只发生在生产环境中，在dev服务器上，一切都是光速的 upd9:关于这个问题有两个问题：如何修复它标称动物建议同步并删除所有内容，这真的很有帮助sudo sh-c'同步；echo 3>/proc/sys/vm/drop\u缓存工作upd12:但实际上sync已经足够了。为什么会发生这种情况？它对我来说仍然是开放的，我知道将durty页面刷新到磁盘会消耗内核CPU和IO时间，这是正常的但什么是strage，为什么即使是

在服务器上引导java应用程序时，linux内核的CPU占用率很高。这个问题只发生在生产环境中，在dev服务器上，一切都是光速的

upd9:关于这个问题有两个问题：

如何修复它标称动物建议同步并删除所有内容，这真的很有帮助<代码>sudo sh-c'同步；echo 3>/proc/sys/vm/drop\u缓存工作upd12:但实际上
sync
已经足够了。

为什么会发生这种情况？它对我来说仍然是开放的，我知道将durty页面刷新到磁盘会消耗内核CPU和IO时间，这是正常的但什么是strage，为什么即使是用“C”编写的单线程应用程序也会在内核空间中100%加载所有内核？

由于ref-upd10和ref-upd11的原因，我认为

echo 3>/proc/sys/vm/drop\u缓存不需要解决内存分配缓慢的问题。
在启动占用内存的应用程序之前，运行'sync'就足够了。
可能会在明天的制作中尝试这个，并在这里发布结果
upd10:FS缓存页丢失案例：
我执行了cat10gb.fiel>/dev/null
，然后
sync
当然，没有显示184kb的durty页面（cat/proc/meminfo | grep^Dirty
检查cat/proc/meminfo | grep^Cached
我得到：4GB缓存
运行intmain（char**）
我得到了正常的性能（比如50ms来初始化32MB的分配数据）
缓存内存减少到900MB
测试摘要：我认为linux将用作FS缓存的页面回收到分配的内存中是没有问题的。
upd11:大量脏页案例
列表项
我运行带有注释的read
部分的howmongodworks
示例，过了一段时间
/proc/meminfo
说2.8GB是脏的
，3.6GB是缓存的

我停止了howmongodworks
并运行了intmain（char**）

以下是部分结果：
初始15，时间0.00s
x 0[尝试1/部分0]时间1.11s
x 1[尝试2次/零件0]时间0.04s
x 0[try 1/part 1]时间1.04s
x1[尝试2/第1部分]时间0.05秒
x 0[尝试1/部分2]时间0.42s
x1[尝试2/第2部分]时间0.04s
测试总结：durty页面的丢失显著降低了对分配内存的首次访问速度（公平地说，只有当应用程序内存总量开始与整个操作系统内存相当时，这种情况才会发生，即如果您有8/16 GB的可用空间，分配1GB是没有问题的，从3GB左右开始降低速度）。
现在我设法在我的开发环境中重现了这种情况，下面是一些新的细节
开发人员计算机配置：
Linux 2.6.32-220.13.1.el6.x86_64-科学Linux 6.1版（碳）
内存：15.55 GB
CPU:1 X Intel（R）Core（TM）i5-2300 CPU@2.80GHz（4个线程）（物理）
99.9%的问题是由FS缓存中的大量durty页面造成的。以下是在脏页面上创建大量页面的应用程序：
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.Random;

/**
 * @author dmitry.mamonov
 *         Created: 10/2/12 2:53 PM
 */
public class HowMongoDdWorks{
    public static void main(String[] args) throws IOException {
        final long length = 10L*1024L*1024L*1024L;
        final int pageSize = 4*1024;
        final int lengthPages = (int) (length/pageSize);
        final byte[] buffer = new byte[pageSize];
        final Random random = new Random();
        System.out.println("Init file");
        final RandomAccessFile raf = new RandomAccessFile("random.file","rw");
        raf.setLength(length);
        int written = 0;
        int readed = 0;
        System.out.println("Test started");
        while(true){
            { //write.
                random.nextBytes(buffer);
                final long randomPageLocation = (long)random.nextInt(lengthPages)*(long)pageSize;
                raf.seek(randomPageLocation);
                raf.write(buffer);
                written++;
            }
            { //read.
                random.nextBytes(buffer);
                final long randomPageLocation = (long)random.nextInt(lengthPages)*(long)pageSize;
                raf.seek(randomPageLocation);
                raf.read(buffer);
                readed++;
            }
            if (written % 1024==0 || readed%1024==0){
                System.out.printf("W %10d R %10d pages\n", written, readed);
            }

        }
    }
}

这是一个测试应用程序，它会导致内核空间中的HI（所有内核都达到100%）CPU负载（如下所示，但我将再次复制它）
我将所有内容都保留在这条线以下，只是出于历史原因。

upd1：开发系统和生产系统都足以进行此测试。
upd7：它不是分页，至少我在出现问题时没有看到任何存储IO活动
dev~4核，16 GM RAM，约8 GB可用空间
生产12芯，24 GB
RAM，大约16GB的可用空间（8到10GM在FS缓存下，但它没有
不同的是，即使所有16GM都是完全免费的，结果也是一样的），这台机器也是由CPU加载的，但不会太高~10%
upd8（参考）：新的测试用例和潜在的解释见尾部
这是我的测试用例（我也测试了java和python，但“c”应该最清楚）：
生产机器上的输出（部分）：
在开发机器上运行此测试时，CPU的使用率甚至没有从gound上升，就像htop中所有内核的使用率都低于5%
但在生产机器上运行此测试时，我看到所有内核的CPU使用率高达100%（在12核机器上，平均负载上升到50%），这都是内核时间
upd2:所有机器都安装了相同的centos linux 2.6，我使用ssh与它们一起工作
upd3:A:不太可能进行交换，在我的测试期间没有看到任何磁盘活动，而且大量RAM也是可用的。（此外，descriptin已更新）。-Dmitry 9分钟前
upd4:htop表示内核的CPU利用率较高，所有内核的利用率高达100%（在产品上）
upd5:初始化完成后CPU利用率是否稳定下来？在我的简单测试中-是的。对于实际应用程序，它只会帮助停止其他一切以启动新程序（这是胡说八道）
我有两个问题：
为什么会这样
如何修复它
upd8:改进的测试和解释
#include<stdlib.h>
#include<stdio.h>
#include<time.h>

int main(char** argv){
    const int partition = 8;
   int last = clock();
   for(int i=0;i<16;i++){
       int size = 256 * 1024 * 1024;
       int size4=size/4;
       int* buffer = malloc(size);
       buffer[0]=123;
       printf("init %d, time %.2fs\n",i, (clock()-last)/(double)CLOCKS_PER_SEC);
       last = clock();
       for(int p=0;p<partition;p++){
            for(int k=0;k<2;k++){
                for(int j=p*size4/partition;j<(p+1)*size4/partition;j++){
                    buffer[j]=k;
                }
                printf("x [try %d/part %d] time %.2fs\n",k+1, p, (clock()-last)/(double)CLOCKS_PER_SEC);
                last = clock();
            }
      }
   }
   return 0;
}

我从这次测试中学到的事实
内存分配本身很快
对已分配内存的第一次访问速度很快（因此这不是一个延迟缓冲区分配问题）
我将分配的缓冲区拆分为多个部分（测试中为8个）
用值0填充每个缓冲区部分，然后用值1填充，打印消耗的时间
第二个缓冲区部分填充始终很快
但是furst缓冲区部分填充总是比第二次填充慢一点（我相信在第一页访问时我的内核会做一些额外的工作）
有时，这需要更长的时间
x [1] 0.23
x [2] 0.19
x [1] 0.24
x [2] 0.19
x [1] 1.30 -- first initialization takes significantly longer
x [2] 0.19 -- then seconds one (6x times slowew)
x [1] 10.94 -- and some times it is 50x slower!!!
x [2] 0.19
x [1] 1.10
x [2] 0.21
x [1] 1.52
x [2] 0.19
x [1] 0.94
x [2] 0.21
x [1] 2.36
x [2] 0.20
x [1] 3.20
x [2] 0.20 -- and the results is totally unstable
...

#include<stdlib.h>
#include<stdio.h>
#include<time.h>

int main(char** argv){
   int last = clock(); //remember the time
   for(int i=0;i<16;i++){ //repeat test several times
      int size = 256 * 1024 * 1024;
      int size4=size/4;
      int* buffer = malloc(size); //allocate 256MB of memory
      for(int k=0;k<2;k++){ //initialize allocated memory twice
          for(int j=0;j<size4;j++){ 
              //memory initialization (if I skip this step my test ends in 
              buffer[j]=k; 0.000s
          }
          //printing 
          printf(x "[%d] %.2f\n",k+1, (clock()-last)/(double)CLOCKS_PER_SEC); stat
          last = clock();
      }
   }
   return 0;
}

x [1] 0.13 --first initialization takes a bit longer
x [2] 0.12 --then second one, but the different is not significant.
x [1] 0.13
x [2] 0.12
x [1] 0.15
x [2] 0.11
x [1] 0.14
x [2] 0.12
x [1] 0.14
x [2] 0.12
x [1] 0.13
x [2] 0.12
x [1] 0.14
x [2] 0.11
x [1] 0.14
x [2] 0.12 -- and the results is quite stable
...

x [1] 0.23
x [2] 0.19
x [1] 0.24
x [2] 0.19
x [1] 1.30 -- first initialization takes significantly longer
x [2] 0.19 -- then seconds one (6x times slowew)
x [1] 10.94 -- and some times it is 50x slower!!!
x [2] 0.19
x [1] 1.10
x [2] 0.21
x [1] 1.52
x [2] 0.19
x [1] 0.94
x [2] 0.21
x [1] 2.36
x [2] 0.20
x [1] 3.20
x [2] 0.20 -- and the results is totally unstable
...

#include<stdlib.h>
#include<stdio.h>
#include<time.h>

int main(char** argv){
    const int partition = 8;
   int last = clock();
   for(int i=0;i<16;i++){
       int size = 256 * 1024 * 1024;
       int size4=size/4;
       int* buffer = malloc(size);
       buffer[0]=123;
       printf("init %d, time %.2fs\n",i, (clock()-last)/(double)CLOCKS_PER_SEC);
       last = clock();
       for(int p=0;p<partition;p++){
            for(int k=0;k<2;k++){
                for(int j=p*size4/partition;j<(p+1)*size4/partition;j++){
                    buffer[j]=k;
                }
                printf("x [try %d/part %d] time %.2fs\n",k+1, p, (clock()-last)/(double)CLOCKS_PER_SEC);
                last = clock();
            }
      }
   }
   return 0;
}

init 15, time 0.00s -- malloc call takes nothing.
x [try 1/part 0] time 0.07s -- usually first try to fill buffer part with values is fast enough.
x [try 2/part 0] time 0.04s -- second try to fill buffer part with values is always fast.
x [try 1/part 1] time 0.17s
x [try 2/part 1] time 0.05s -- second try...
x [try 1/part 2] time 0.07s
x [try 2/part 2] time 0.05s -- second try...
x [try 1/part 3] time 0.07s
x [try 2/part 3] time 0.04s -- second try...
x [try 1/part 4] time 0.08s
x [try 2/part 4] time 0.04s -- second try...
x [try 1/part 5] time 0.39s -- BUT some times it takes significantly longer then average to fill part of allocated buffer with values.
x [try 2/part 5] time 0.05s -- second try...
x [try 1/part 6] time 0.35s
x [try 2/part 6] time 0.05s -- second try...
x [try 1/part 7] time 0.16s
x [try 2/part 7] time 0.04s -- second try...

sudo sh -c 'sync ; echo 3 > /proc/sys/vm/drop_caches ; sync'

ps axu | sed -ne '/ sed -ne /d; /java/p'

strace -f -o trace.log -q -tt -T -e trace=open COMMAND...

strace -f -o trace -ff -q -tt -T -e trace=open COMMAND...

LANG=C LC_ALL=C sed -ne 's|^[^"]* open("\(.*\)", O[^"]*$|\1|p' trace.* \
| LANG=C LC_ALL=C sed -ne 's|^[^"]* open("\(.*\)", O[^"]*$|\1|p' \
| LANG=C LC_ALL=C xargs -r -d '\n' filefrag \
| LANG=C LC_ALL=C awk '(NF > 3 && $NF == "found") { n[$(NF-2)]++ }
  END { for (i in n) printf "%d extents %d files\n", i, n[i] }' \
| sort -g

LANG=C LC_ALL=C strace -f -q -tt -T -e trace=open COMMAND... 2>&1 \
| LANG=C LC_ALL=C sed -ne 's|^[0-9:.]* open("\(.*\)", O[^"]*$|\1|p' \
| LANG=C LC_ALL=C xargs -r filefrag \
| LANG=C LC_ALL=C awk '(NF > 3 && $NF == "found") { n[$(NF-2)]++ }
  END { for (i in n) printf "%d extents %d files\n", i, n[i] }' \
| sort -g

#define _POSIX_C_SOURCE 200809L
#include <time.h>
#include <stdio.h>

/* in work.c, adjust as needed */
void work_init(void);      /* Optional, allocations etc. */
void work(long iteration); /* Completely up to you, including parameters */
void work_done(void);      /* Optional, deallocations etc. */

#define PRIMING    0
#define REPEATS  100

int main(void)
{
    double          wall_seconds[REPEATS];
    struct timespec wall_start, wall_stop;
    long            iteration;

    work_init();

    /* Priming: do you want caches hot? */
    for (iteration = 0L; iteration < PRIMING; iteration++)
        work(iteration);

    /* Timed iterations */
    for (iteration = 0L; iteration < REPEATS; iteration++) {
        clock_gettime(CLOCK_REALTIME, &wall_start);
        work(iteration);
        clock_gettime(CLOCK_REALTIME, &wall_stop);
        wall_seconds[iteration] = (double)(wall_stop.tv_sec - wall_start.tv_sec)
                                + (double)(wall_stop.tv_nsec - wall_start.tv_nsec) / 1000000000.0;
    }

    work_done();

    /* TODO: wall_seconds[0] is the first iteration.
     *       Comparing to successive iterations (assuming REPEATS > 0)
     *       tells you about the initial latency.
    */

    /* TODO: Sort wall_seconds, for easier statistics.
     *       Most reliable value is the median, with half of the
     *       values larger and half smaller.
     *       Personally, I like to discard first and last 15.85%
     *       of the results, to get "one-sigma confidence" interval.
    */

    return 0;
}