Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/user-interface/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 在AWS EMR的Pyspark中从S3读取大文件时出现S3AbortableInputStream警告_Apache Spark_Pyspark - Fatal编程技术网

Apache spark 在AWS EMR的Pyspark中从S3读取大文件时出现S3AbortableInputStream警告

Apache spark 在AWS EMR的Pyspark中从S3读取大文件时出现S3AbortableInputStream警告,apache-spark,pyspark,Apache Spark,Pyspark,在Pyspark中从S3读取大型数据集时,我在AWS EMR上不断遇到此错误 INFO FileScanRDD: Reading File path: s3a://bucket/dir1/dir2/dir3/2018-01-31/part-XXX-YYYY-c000.snappy.parquet, range: 0-11383, partition values: [empty row] WARN S3AbortableInputStream: Not all bytes were read

在Pyspark中从S3读取大型数据集时,我在AWS EMR上不断遇到此错误

INFO FileScanRDD: Reading File path: s3a://bucket/dir1/dir2/dir3/2018-01-31/part-XXX-YYYY-c000.snappy.parquet, 
range: 0-11383, partition values: [empty row]

WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. 
This is likely an error and may result in sub-optimal behavior. 
Request only the bytes you need via a ranged GET or drain the input stream after use.
阅读是相当标准的:

df = spark.read.parquet(s3_path)
以前有人遇到过这个错误吗?有什么建议吗?
提前感谢。

这是一个警告,不是错误,因为上面写着
警告
。您可以安全地忽略该警告,或者尝试升级到Hadoop 2.9或3.0以消除它

这些警告是由AWS Java SDK引发的,因为Hadoop有意提前中止读取操作。(看起来您使用的是
s3a://
,因此Spark通过Hadoop与S3进行交互。)

您可以从负责S3A的Hadoop提交人和AWS Java SDK维护人员之间了解更多有关此警告的信息。建议如果使用Hadoop 2.9或3.0(使用更新版本的AWS SDK),警告将消失