优化并重写以下C代码_C_Optimization

优化并重写以下C代码

c optimization

优化并重写以下C代码,c,optimization,C,Optimization,这是一个教科书式的问题，涉及重写一些C代码，使其在给定的处理器体系结构上表现最佳给定：目标是一个具有4个加法器和2个乘法器单元的超标量处理器输入结构（在别处初始化）：下面是对这些数据进行操作的例程。显然，必须确保正确性，但目标是优化它的垃圾 int compute(int x, int *r, int *q, int *p) { int i; for(i = 0; i < 100; i++) { *r *= input[i].v + x;

这是一个教科书式的问题，涉及重写一些C代码，使其在给定的处理器体系结构上表现最佳

给定：目标是一个具有4个加法器和2个乘法器单元的超标量处理器

输入结构（在别处初始化）：

下面是对这些数据进行操作的例程。显然，必须确保正确性，但目标是优化它的垃圾

int compute(int x, int *r, int *q, int *p) {

    int i;
    for(i = 0; i < 100; i++) {

        *r *= input[i].v + x;
        *p = input[i].v;
        *q += input[i].a + input[i].v + input[i].b;
    }

    return i;
}

int计算（int x，int*r，int*q，int*p）{
int i；
对于（i=0；i<100；i++）{
*r*=输入[i].v+x；
*p=输入[i].v；
*q+=输入[i]。a+输入[i]。v+输入[i]。b；
}
返回i；
}

所以这个方法有3条算术指令来更新整数r，q，p

下面是我的评论，解释我的想法：

//Use temp variables so we don't keep using loads and stores for mem accesses; 
//hopefully the temps will just be kept in the register file
int r_temp = *r;
int q_temp = *q;

for (i=0;i<99;i = i+2) {
    int data1 = input[i];
    int data2 = input[i+1]; //going to try partially unrolling loop
    int a1 = data1.a;
    int a2 = data2.a;
    int b1 = data1.b;
    int b2 = data2.b;
    int v1 = data1.v;
    int v2 = data2.v;

    //I will use brackets to make my intention clear the order of operations I was planning
    //with respect to the functional (adder, mul) units available

    //This is calculating the next iteration's new q value 
    //from q += v1 + a1 + b1, or q(new)=q(old)+v1+a1+b1

    q_temp = ((v1+q1)+(a1+b1)) + ((a2+b2)+v2);
    //For the first step I am trying to use a max of 3 adders in parallel, 
    //saving one to start the next computation

    //This is calculating next iter's new r value 
    //from r *= v1 + x, or r(new) = r(old)*(v1+x)

    r_temp = ((r_temp*v1) + (r_temp*x)) + (v2+x);
}
//Because i will end on i=98 and I only unrolled by 2, I don't need to 
//worry about final few values because there will be none

*p = input[99].v; //Why it's in the loop I don't understand, this should be correct
*r = r_temp;
*q = q_temp;

//使用临时变量，这样我们就不会继续使用加载和存储来访问mem；
//希望临时文件能保存在注册文件中
int r_temp=*r；
int q_temp=*q；
对于（i=0；i是的，可以利用这两个空头。重新安排结构
struct s {
    unsigned v;
    short a;
    short b;
} input[100];

而且，您可能能够更好地对齐体系结构上的内存字段，这可能允许更多这些结构位于同一内存页中，这可能允许您遇到更少的内存页错误
这都是推测，这就是为什么它是如此重要的个人资料
如果您有正确的体系结构，则重新排列将为您带来更好的性能，从而导致内存中的数据密度更高，因为在必要的填充中丢失的位更少，以确保类型与常见内存体系结构施加的数据边界对齐。
是否允许您将数据布局从结构数组更改为arr结构ays？据我所知，输入结构保持原样。我不知道是否有可能利用其中两个字段为“short”类型这一事实以某种方式帮助优化。我不知道，但切换到数组结构可能会使其矢量化。如果您可以将本地声明放到外部，则会使循环更小，并使cpu的循环缓存有机会工作。对于小循环，它通常是32或64字节长的缓存。在lo中时不会从内存中获取指令op.我指的是qOp cacheDouble post…但是，仅仅为了grins，我在x86_64系统上通过gcc 4.5.3运行了代码，对于无优化的-O0
，每个循环有40个算术运算。对于-O3
，其中25个被删除，剩下15个。short
保证比int更小或相等的大小重排绝对是个好主意——它永远不会伤害。一个一般的经验法则可能是不完美的，但比没有经验法则要好的是将结构类型从最长到最短列出。它倾向于确保比随机排序更好的打包。这是一个非常微妙的问题，我没有考虑，谢谢。
struct s {
    unsigned v;
    short a;
    short b;
} input[100];