Apache spark 在AWS EMR的Pyspark中从S3读取大文件时出现S3AbortableInputStream警告
在Pyspark中从S3读取大型数据集时,我在AWS EMR上不断遇到此错误Apache spark 在AWS EMR的Pyspark中从S3读取大文件时出现S3AbortableInputStream警告,apache-spark,pyspark,Apache Spark,Pyspark,在Pyspark中从S3读取大型数据集时,我在AWS EMR上不断遇到此错误 INFO FileScanRDD: Reading File path: s3a://bucket/dir1/dir2/dir3/2018-01-31/part-XXX-YYYY-c000.snappy.parquet, range: 0-11383, partition values: [empty row] WARN S3AbortableInputStream: Not all bytes were read
INFO FileScanRDD: Reading File path: s3a://bucket/dir1/dir2/dir3/2018-01-31/part-XXX-YYYY-c000.snappy.parquet,
range: 0-11383, partition values: [empty row]
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection.
This is likely an error and may result in sub-optimal behavior.
Request only the bytes you need via a ranged GET or drain the input stream after use.
阅读是相当标准的:
df = spark.read.parquet(s3_path)
以前有人遇到过这个错误吗?有什么建议吗?
提前感谢。这是一个警告,不是错误,因为上面写着
警告
。您可以安全地忽略该警告,或者尝试升级到Hadoop 2.9或3.0以消除它
这些警告是由AWS Java SDK引发的,因为Hadoop有意提前中止读取操作。(看起来您使用的是s3a://
,因此Spark通过Hadoop与S3进行交互。)
您可以从负责S3A的Hadoop提交人和AWS Java SDK维护人员之间了解更多有关此警告的信息。建议如果使用Hadoop 2.9或3.0(使用更新版本的AWS SDK),警告将消失