汇编/_asm内联我正在学习汇编和在我的数字MARS C++编译器中制作一些内联。我搜索了一些东西来改进程序，并使用这些参数来调整程序： use better C++ compiler//thinking of GCC or intel compiler use assembly only in critical part of program find better algorithm Cache miss, cache contention. Loop-carried dependency chain. Instruction fetching time. Instruction decoding time. Instruction retirement. Register read stalls. Execution port throughput. Execution unit throughput. Suboptimal reordering and scheduling of micro-ops. Branch misprediction. Floating point exception._C++_Optimization_Assembly_Inline Assembly

汇编/_asm内联我正在学习汇编和在我的数字MARS C++编译器中制作一些内联。我搜索了一些东西来改进程序，并使用这些参数来调整程序： use better C++ compiler//thinking of GCC or intel compiler use assembly only in critical part of program find better algorithm Cache miss, cache contention. Loop-carried dependency chain. Instruction fetching time. Instruction decoding time. Instruction retirement. Register read stalls. Execution port throughput. Execution unit throughput. Suboptimal reordering and scheduling of micro-ops. Branch misprediction. Floating point exception.

c++ optimization assembly

汇编/_asm内联我正在学习汇编和在我的数字MARS C++编译器中制作一些内联。我搜索了一些东西来改进程序，并使用这些参数来调整程序： use better C++ compiler//thinking of GCC or intel compiler use assembly only in critical part of program find better algorithm Cache miss, cache contention. Loop-carried dependency chain. Instruction fetching time. Instruction decoding time. Instruction retirement. Register read stalls. Execution port throughput. Execution unit throughput. Suboptimal reordering and scheduling of micro-ops. Branch misprediction. Floating point exception.,c++,optimization,assembly,inline-assembly,C++,Optimization,Assembly,Inline Assembly,除了“寄存器读取暂停”，我什么都懂问：有人能告诉我CPU和“超标量”形式的“无序执行”是如何发生的吗？正常的“无序”似乎合乎逻辑，但我找不到“超标量”形式的逻辑解释问题2：是否有人能给出SSE2和更新CPU的一些好的指令列表，包括微操作表、端口吞吐量、单位和一些延迟计算表，以找到一段代码的真正瓶颈我很乐意举一个这样的小例子： //loop carried dependency chain breaking: __asm { loop_begin: .... .... sub edx,0

除了“寄存器读取暂停”，我什么都懂

问：有人能告诉我CPU和“超标量”形式的“无序执行”是如何发生的吗？正常的“无序”似乎合乎逻辑，但我找不到“超标量”形式的逻辑解释

问题2：是否有人能给出SSE2和更新CPU的一些好的指令列表，包括微操作表、端口吞吐量、单位和一些延迟计算表，以找到一段代码的真正瓶颈

我很乐意举一个这样的小例子：

//loop carried dependency chain breaking:
__asm
{
loop_begin:
....
.... 
sub edx,05h //rather than taking i*5 in each iteration, we sub 5 each iteration
sub ecx,01h //i-- counter
...
...
jnz loop_begin//edit: sub ecx must have been after the sub edx for jnz
}
//while sub edx makes us get rid of a multiplication also makes that independent of ecx, making independent

多谢各位

计算机：奔腾M 2GHz，Windows XP-32位

您应该看看Agner Fogs优化手册：或

但是，要想真正超越现代编译器，您需要对您要优化的领域有一些良好的背景知识：

您应该看看Agner Fogs优化手册：或

但要想真正超越现代编译器，您需要对您要优化的arch有一些良好的背景知识：

我的两分钱：非常详细，还有所有SSE指令，包括操作码、指令延迟和吞吐量，以及您可能需要的所有血淋淋的详细信息：）

对于调度指令，“超标量”暂停是一个额外的问题。现代处理器不仅可以无序执行指令，还可以使用并行执行单元一次执行3-4条简单指令

但要真正做到这一点，指令必须彼此充分独立。例如，如果一条指令使用前一条指令的结果，它必须等待该结果可用

在实践中，这使得手工创建最佳装配程序极其困难。你必须像计算机（编译器）一样计算指令的最佳顺序。如果你改变了一条指令，你必须重新做一遍……

对于调度指令来说，“超标量”暂停是一个额外的问题。现代处理器不仅可以无序执行指令，还可以使用并行执行单元一次执行3-4条简单指令

但要真正做到这一点，指令必须彼此充分独立。例如，如果一条指令使用前一条指令的结果，它必须等待该结果可用

在实践中，这使得手工创建最佳装配程序极其困难。你必须像计算机（编译器）一样计算指令的最佳顺序。如果你改变了一个指令，你必须重新做一遍……

对于问题1，我强烈推荐。它很好地解释了上下文中的概念，因此您可以看到全局。这些示例对于对优化代码感兴趣的人来说也非常有用，因为它们总是关注优先级和改进瓶颈。

对于问题1，我强烈推荐。它很好地解释了上下文中的概念，因此您可以看到全局。这些示例对于那些对优化代码感兴趣的人来说也非常有用，因为它们总是专注于优先排序和改进瓶颈。

我已经读过Agners的一些文章，它们很好，但对于一些小部分，我需要其他来源。谢谢。我已经读过阿格内尔的一些作品，它们很好，但对于一些小部分，我需要其他来源。非常感谢。