Delphi 如何在代码中高效地旋转位图_Delphi_Image Processing_Image Manipulation_Rotation

Delphi 如何在代码中高效地旋转位图

delphi image-processing

Delphi 如何在代码中高效地旋转位图,delphi,image-processing,image-manipulation,rotation,Delphi,Image Processing,Image Manipulation,Rotation,有没有比简单地使用反转坐标进行嵌套循环更快的方法将大型位图旋转90度或270度位图为8bpp，通常为204824008bpp 目前，我通过简单地复制参数反转来实现这一点（伪代码： for x = 0 to 2048-1 for y = 0 to 2048-1 dest[x][y]=src[y][x]; （实际上我是用指针来做的，速度稍微快一点，但大小大致相同）对于大型图像，GDI相当慢，纹理（GF7卡）的GPU加载/存储时间与当前CPU时间相同任何提示，指针？就地算法甚至会更

有没有比简单地使用反转坐标进行嵌套循环更快的方法将大型位图旋转90度或270度

位图为8bpp，通常为204824008bpp

目前，我通过简单地复制参数反转来实现这一点（伪代码：

for x = 0 to 2048-1
  for y = 0 to 2048-1
    dest[x][y]=src[y][x];

（实际上我是用指针来做的，速度稍微快一点，但大小大致相同）

对于大型图像，GDI相当慢，纹理（GF7卡）的GPU加载/存储时间与当前CPU时间相同

任何提示，指针？就地算法甚至会更好，但速度比就地更重要

目标是Delphi，但这更像是一个算法问题。SSE（2）矢量化没有问题，这是一个大问题，我可以在汇编程序中编写它

跟进尼尔斯的答覆

图像2048x2700->2700x2048
带优化功能的编译器Turbo Explorer 2006
Windows：电源方案设置为“始终打开”。（重要！！！！）
机器：Core26600（2.4 GHz）

使用旧程序的时间：32ms（步骤1）步长为8:12ms的时间步长为16的时间：10ms 步长为32+的时间：9ms

同时，我还在Athlon 64 X2（5200+iirc）上进行了测试，那里的速度提升略高于四分之一（80到19毫秒）

加速是值得的，谢谢。也许在夏季的几个月里，我会用SSE（2）版本折磨自己。但是我已经考虑过如何解决这个问题，我想我会用光SSE2寄存器来直接实现：

for n:=0 to 7 do
  begin
    load r0, <source+n*rowsize> 
    shift byte from r0 into r1
    shift byte from r0 into r2
    ..
    shift byte from r0 into r8
  end; 
store r1, <target>   
store r2, <target+1*<rowsize>
..
store r8, <target+7*<rowsize>

const stepsize = 32;
procedure rotatealign(Source: tbw8image; Target:tbw8image);

var stepsx,stepsy,restx,resty : Integer;
   RowPitchSource, RowPitchTarget : Integer;
   pSource, pTarget,ps1,ps2 : pchar;
   x,y,i,j: integer;
   rpstep : integer;
begin
  RowPitchSource := source.RowPitch;          // bytes to jump to next line. Can be negative (includes alignment)
  RowPitchTarget := target.RowPitch;        rpstep:=RowPitchTarget*stepsize;
  stepsx:=source.ImageWidth div stepsize;
  stepsy:=source.ImageHeight div stepsize;
  // check if mod 16=0 here for both dimensions, if so -> SSE2.
  for y := 0 to stepsy - 1 do
    begin
      psource:=source.GetImagePointer(0,y*stepsize);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(target.imagewidth-(y+1)*stepsize,0);
      for x := 0 to stepsx - 1 do
        begin
          for i := 0 to stepsize - 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[stepsize-1-i];       //  (maxx-i,0);
              for j := 0 to stepsize - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
          inc(psource,stepsize);
          inc(ptarget,rpstep);
        end;
    end;
  // 3 more areas to do, with dimensions
  // - stepsy*stepsize * restx        // right most column of restx width
  // - stepsx*stepsize * resty        // bottom row with resty height
  // - restx*resty                    // bottom-right rectangle.
  restx:=source.ImageWidth mod stepsize;   // typically zero because width is 
                                          // typically 1024 or 2048
  resty:=source.Imageheight mod stepsize;
  if restx>0 then
    begin
      // one loop less, since we know this fits in one line of  "blocks"
      psource:=source.GetImagePointer(source.ImageWidth-restx,0);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(Target.imagewidth-stepsize,Target.imageheight-restx);
      for y := 0 to stepsy - 1 do
        begin
          for i := 0 to stepsize - 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[stepsize-1-i];       //  (maxx-i,0);
              for j := 0 to restx - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
         inc(psource,stepsize*RowPitchSource);
         dec(ptarget,stepsize);
       end;
    end;
  if resty>0 then
    begin
      // one loop less, since we know this fits in one line of  "blocks"
      psource:=source.GetImagePointer(0,source.ImageHeight-resty);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(0,0);
      for x := 0 to stepsx - 1 do
        begin
          for i := 0 to resty- 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[resty-1-i];       //  (maxx-i,0);
              for j := 0 to stepsize - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
         inc(psource,stepsize);
         inc(ptarget,rpstep);
       end;
    end;
 if (resty>0) and (restx>0) then
    begin
      // another loop less, since only one block
      psource:=source.GetImagePointer(source.ImageWidth-restx,source.ImageHeight-resty);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(0,target.ImageHeight-restx);
      for i := 0 to resty- 1 do
        begin
          ps1:=@psource[rowpitchsource*i];   // ( 0,i)
          ps2:=@ptarget[resty-1-i];       //  (maxx-i,0);
          for j := 0 to restx - 1 do
            begin
              ps2[0]:=ps1[j];
              inc(ps2,RowPitchTarget);
            end;
       end;
    end;
end;

更新2个泛型

我试图在Delphi XE中将此代码更新为泛型版本。由于QC 99703，我失败了，论坛人员已经确认XE2中也存在此代码。请投票支持：-）

更新3个仿制药现在在XE10中工作

更新4

2017年，我为shuffle瓶颈做了一些汇编版本的工作，Peter Cordes慷慨地帮助我解决了这些瓶颈。这段代码仍然错过了一个机会，并且仍然需要再次使用另一个循环平铺级别来将多个8x8块迭代聚合为伪较大的迭代，如64x64。现在又是整行了，这是浪费。

您可以通过在缓存对齐的块中而不是按行进行复制来改进它，因为目前src dest的跨步将丢失（取决于delphi是行主还是列主）。

是的，有更快的方法来实现这一点

简单循环的大部分时间用于缓存未命中。这是因为在一个紧密的循环中，您在非常不同的位置接触了大量数据。更糟糕的是：你的记忆位置正好是两个相距的幂。这是缓存性能最差的大小

如果改进内存访问的局部性，则可以改进此旋转算法

一种简单的方法是，使用与整个位图相同的代码单独旋转每个8x8像素块，并环绕另一个循环，将图像旋转拆分为每个8x8像素块

例如，类似这样的内容（未检查，很抱歉C代码。我的Delphi技能不是最新的）：

//这是中断图像旋转的外部循环
//分为每个8x8像素的块：
对于（int block_x=0；block_x<2048；block_x+=8）
{
对于（整型块y=0；块y<2048；块y+=8）
{ 
//这是处理块的内部循环
//8x8像素。
对于（int x=0；x如果图像不是正方形，则无法在位。即使在正方形图像中工作，变换也不利于在位工作
如果你想把事情做得快一点，你可以尝试利用行的步幅来让它工作，但我认为最好的办法是从源代码中一次读取4个字节，然后在dest中写入四个连续的行。这会减少一些开销，但我不希望超过5%的改进如果你可以使用C++，那么你可能想看一下。
是一个C++模板库，它使用SSE（2和后面）和AltiVec指令集，对非矢量化代码具有优美的后退。
快速。（参见基准测试）。

表达式模板允许在适当的时候智能地删除临时变量并启用延迟计算——在大多数情况下，Eigen会自动处理这一问题并处理别名。

对SSE（2及更高版本）和AltiVec指令集执行显式矢量化，并以优雅的方式回退到非矢量化代码。表达式模板允许对整个表达式全局执行这些优化。

对于固定大小的对象，可以避免动态内存分配，并且在有意义时展开循环。

对于大型矩阵，特别注意缓存友好性
（问题不在于=的两边有一条总是不对齐吗？我要么线性地走src，要么走dst，但决不能同时走这两条）不，你可以复制一个块并在缓存中转置它，就像Nils演示的那样（假设他所在的机器有256字节缓存线）好的，对不起，然后我弄错了，是的，零的答案是我想要的好的，谢谢。这听起来很有希望。现在10倍的速度就足够了，我不会费心去搞乱SSE的。至少现在不会。我有一种分块做的粗糙感觉，这既证实了这一点，也为我提供了一个实现。这种技术被称为循环平铺。哦，别忘了你可以并行缩放。只需启动两个线程，让每个线程处理一半图像。顺便说一句，Marco，一旦你实现了它，让我们知道你有多快的速度。我只是好奇它在现实世界的应用程序中表现如何。另一个更新。今年夏天，我将y做了汇编程序。对于较小的图像大小，它比循环平铺的速度快4倍，同样正确，但是我