Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/amazon-s3/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Amazon s3 在简单SparkSQL查询中未修剪的分区_Amazon S3_Apache Spark_Apache Spark Sql_Pyspark_Parquet - Fatal编程技术网

Amazon s3 在简单SparkSQL查询中未修剪的分区

Amazon s3 在简单SparkSQL查询中未修剪的分区,amazon-s3,apache-spark,apache-spark-sql,pyspark,parquet,Amazon S3,Apache Spark,Apache Spark Sql,Pyspark,Parquet,我试图从SparkSQL表(S3中的拼花地板)中高效地选择各个分区。然而,我看到了火花打开表中所有拼花地板文件的证据,而不仅仅是那些通过过滤器的文件。这使得对于具有大量分区的表,即使是很小的查询也很昂贵 这里有一个说明性的例子。我使用SparkSQL和配置单元元存储在S3上创建了一个简单的分区表: # Make some data df = pandas.DataFrame({'pk': ['a']*5+['b']*5+['c']*5, 'k':

我试图从SparkSQL表(S3中的拼花地板)中高效地选择各个分区。然而,我看到了火花打开表中所有拼花地板文件的证据,而不仅仅是那些通过过滤器的文件。这使得对于具有大量分区的表,即使是很小的查询也很昂贵

这里有一个说明性的例子。我使用SparkSQL和配置单元元存储在S3上创建了一个简单的分区表:

# Make some data
df = pandas.DataFrame({'pk': ['a']*5+['b']*5+['c']*5, 
                       'k': ['a', 'e', 'i', 'o', 'u']*3, 
                       'v': range(15)})
# Convert to a SparkSQL DataFrame
sdf = hiveContext.createDataFrame(df)
# And save it
sdf.write.partitionBy('pk').saveAsTable('dataset',
                                        format='parquet',
                                        path='s3a://bucket/dataset')
在后续会话中,我要选择此表的子集:

dataset = hiveContext.table('dataset')
filtered_dataset = dataset.filter(dataset.pk == 'b')
print filtered_dataset.toPandas()
在随后打印的日志中,我看到应该进行修剪:

15/07/05 02:39:39 INFO DataSourceStrategy: Selected 1 partitions out of 3, pruned -200.0% partitions.
但我看到拼花地板文件从所有分区打开:

15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=a/part-r-00001.gz.parquet to seek to new offset 508
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=a/part-r-00001.gz.parquet at pos 508
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=b/part-r-00001.gz.parquet to seek to new offset 509
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=b/part-r-00001.gz.parquet at pos 509
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/_common_metadata to seek to new offset 262
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/_common_metadata at pos 262
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=c/part-r-00001.gz.parquet to seek to new offset 509
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=c/part-r-00001.gz.parquet at pos 509
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=b/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=b/part-r-00001.gz.parquet at pos 152
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=a/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=a/part-r-00001.gz.parquet at pos 151
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/_common_metadata to seek to new offset -266
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/_common_metadata at pos 4
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=c/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=c/part-r-00001.gz.parquet at pos 152

由于只有三个分区,这不是一个问题——但是对于数千个分区,它会导致明显的延迟。为什么要打开所有这些不相关的文件?

看看
spark.sql.parquet.filterPushdown
,默认设置为
false
,因为spark使用的拼花地板版本中存在一些bug。 可以在1.3/1.4中使用,请检查官方


我认为Spark 1.5中已经解决了这一问题。

这可能与问题有关吗?是的,事实证明分区拼花地板数据集的性能是常见的投诉来源。还有一些工作要做,但Spark 1.5在这方面取得了一些重大进展。