Memory 内存上的硬件错误,在128核上进行大型模拟

Memory 内存上的硬件错误,在128核上进行大型模拟,memory,mpi,hardware,distributed-computing,ram,Memory,Mpi,Hardware,Distributed Computing,Ram,在天体物理学方面,我启动了一个大型模拟(enzo代码),在128个核上执行MPI,如下所示: mpirun -np 128 ./enzo.exe amr_cosmology.enzo 在运行过程中,我得到了以下错误:它被标记为硬件错误,因此我得出结论,总RAM(1GB)中有一个是坏的。如您所见,代码不会停止,但在整个代码运行过程中,经常会出现以下错误消息: TopGrid dt = 3.705042e-02 time = 1.2350099725762 cycle = 14

在天体物理学方面,我启动了一个大型模拟(enzo代码),在128个核上执行MPI,如下所示:

mpirun -np 128 ./enzo.exe amr_cosmology.enzo
在运行过程中,我得到了以下错误:它被标记为
硬件错误
,因此我得出结论,总RAM(1GB)中有一个是坏的。如您所见,代码不会停止,但在整个代码运行过程中,经常会出现以下错误消息:

TopGrid dt = 3.705042e-02     time = 1.2350099725762    cycle = 14    z = 834.55610989934
TopGrid dt = 3.816191e-02     time = 1.272060395839    cycle = 15    z = 818.25224654732
TopGrid dt = 3.930675e-02     time = 1.3102223091899    cycle = 16    z = 802.26651295398

Message from syslogd@pablo at Sep 24 20:52:00 ...
 kernel:[2415943.711318] [Hardware Error]: Corrected error, no action required.

Message from syslogd@pablo at Sep 24 20:52:00 ...
 kernel:[2415943.711377] [Hardware Error]: CPU:2 (17:31:0) MC17_STATUS[-|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0x9c2041000000011b

Message from syslogd@pablo at Sep 24 20:52:00 ...
 kernel:[2415943.711387] [Hardware Error]: Error Addr: 0x0000001c9f3d4ac0

Message from syslogd@pablo at Sep 24 20:52:00 ...
 kernel:[2415943.711388] [Hardware Error]: IPID: 0x0000009600450f00, Syndrome: 0x0f5940000a801001

Message from syslogd@pablo at Sep 24 20:52:00 ...
 kernel:[2415943.711399] [Hardware Error]: Unified Memory Controller Extended Error Code: 0

Message from syslogd@pablo at Sep 24 20:52:00 ...
 kernel:[2415943.711407] [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.

Message from syslogd@pablo at Sep 24 20:52:00 ...
 kernel:[2415943.711422] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

Message from syslogd@pablo at Sep 24 20:52:00 ...
 kernel:[2415943.711474] [Hardware Error]: Corrected error, no action required.

Message from syslogd@pablo at Sep 24 20:52:00 ...
 kernel:[2415943.711479] [Hardware Error]: CPU:2 (17:31:0) MC18_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc2041000000011b

Message from syslogd@pablo at Sep 24 20:52:00 ...
 kernel:[2415943.711483] [Hardware Error]: Error Addr: 0x0000001ee2f9b140

Message from syslogd@pablo at Sep 24 20:52:00 ...
 kernel:[2415943.711484] [Hardware Error]: IPID: 0x0000009600550f00, Syndrome: 0xda9020000a800d01

Message from syslogd@pablo at Sep 24 20:52:00 ...
 kernel:[2415943.711489] [Hardware Error]: Unified Memory Controller Extended Error Code: 0

Message from syslogd@pablo at Sep 24 20:52:00 ...
 kernel:[2415943.711492] [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.

Message from syslogd@pablo at Sep 24 20:52:00 ...
 kernel:[2415943.711497] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
TopGrid dt = 4.048593e-02     time = 1.3495290567141    cycle = 17    z = 786.59270291163
TopGrid dt = 4.170048e-02     time = 1.3900149827028    cycle = 18    z = 771.22472945212
TopGrid dt = 4.295147e-02     time = 1.4317154617942    cycle = 19    z = 756.15662471201

这是什么类型的错误:它是自动纠正的还是确实是硬件故障?无论如何,有点不对劲。

这是由于RAM故障造成的。频繁的ECC错误纠正(例如在您的案例中)定义了故障硬件。修复方法是找出导致此问题的内存,并将其更换。如果它不是一个关键系统,您可能不需要立即修复它

在某些情况下,不以预期频率工作的RAM也可能导致此问题

有关更多信息,请参阅参考资料