Apache spark Pyspark--AWS EMR:作业因阶段故障而中止:ShuffleMapStage--无法连接到ip XXX XX XX XX.ec2.internal/XXX.XX.XX.XX:XXXXX

Apache spark Pyspark--AWS EMR:作业因阶段故障而中止:ShuffleMapStage--无法连接到ip XXX XX XX XX.ec2.internal/XXX.XX.XX.XX:XXXXX,apache-spark,pyspark,hdfs,amazon-emr,Apache Spark,Pyspark,Hdfs,Amazon Emr,我有一个AWS EMR的月度数据管道,过去运行良好。在上一次运行中,我们收到的数据负载比平时高得多。现在,当我提交作业时,我开始遇到奇怪的错误,HDFS也会损坏 错误: An error occurred while calling o688.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 183 (showString at NativeMethod

我有一个AWS EMR的月度数据管道,过去运行良好。在上一次运行中,我们收到的数据负载比平时高得多。现在,当我提交作业时,我开始遇到奇怪的错误,HDFS也会损坏

错误:

An error occurred while calling o688.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 183 (showString at NativeMethodAccessorImpl.java:0) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Failed to connect to ip-XXX-XX-XX-XX.ec2.internal/XXX.XX.XX.XX:XXXXX   at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)    
一旦发生此错误,我将检查集群的状态

hdfs fsck /
看看

Total size:    15157892117 B (Total open files size: 135183786 B)
 Total dirs:    64
 Total files:   1787
 Total symlinks:                0 (Files currently being written: 3)
 Total blocks (validated):      1847 (avg. block size 8206763 B) (Total open file blocks (not validated): 4)
  ********************************
  UNDER MIN REPL'D BLOCKS:      193 (10.449377 %)
  dfs.namenode.replication.min: 1
  CORRUPT FILES:        177
  MISSING BLOCKS:       193
  MISSING SIZE:         5932451768 B
  CORRUPT BLOCKS:       193
  ********************************
 Minimally replicated blocks:   1654 (89.55062 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       1413 (76.50243 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    1
 Average block replication:     1.6605306
 Corrupt blocks:                193
 Missing replicas:              1413 (30.237535 %)
 Number of data-nodes:          2
 Number of racks:               1
FSCK ended at Mon Oct 05 23:54:14 UTC 2020 in 21 milliseconds


The filesystem under path '/' is CORRUPT
此外,我过去能够用不到15个任务节点运行管道,但为了避免
内存不足
错误,我必须将管道最多运行45个任务节点,直到遇到此错误

你有没有想过会发生什么,为什么会一直失败