Apache spark 在读取拼花地板文件时，是否有办法在basePath中使用通配符？_Apache Spark_Pyspark

Apache spark 在读取拼花地板文件时，是否有办法在basePath中使用通配符？

apache-spark pyspark

Apache spark 在读取拼花地板文件时，是否有办法在basePath中使用通配符？,apache-spark,pyspark,Apache Spark,Pyspark,在使用带spark read的basePath选项时，是否有一种方法可以通过使用通配符（*）一次性读取具有不同basePath的多个分区拼花地板文件？例如： spark.read.option("basePath","s3://latest/data/*/").parquet(*dir) 获取错误： error: pyspark.sql.utils.IllegalArgumentException: u"Option 'basePath'

在使用带spark read的basePath选项时，是否有一种方法可以通过使用通配符（*）一次性读取具有不同basePath的多个分区拼花地板文件？例如：

spark.read.option("basePath","s3://latest/data/*/").parquet(*dir)

获取错误：

error:   pyspark.sql.utils.IllegalArgumentException: u"Option 'basePath' must be a directory"

不可以。您可以将多个

路径

与单个基本路径结合使用，以获取DF架构中的分区列，但不能指定多个

基本路径

，也不能将通配符用作该基本路径的一部分。string。

您可以简单地给出根路径

spark.read.parquet("s3://latest/data/")

有了这些选择

spark.hive.mapred.supports.subdirectories    true
spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive    true

然后，spark将从

/data/

文件夹到子目录递归地查找拼花地板文件

下面的代码是示例

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SparkSession

val conf = new SparkConf()
    .setMaster("local[2]")
    .setAppName("test")
    .set("spark.hive.mapred.supports.subdirectories","true")
    .set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")

val spark = SparkSession.builder.config(conf).getOrCreate()

val df = spark.read.parquet("s3a://bucket/path/to/base/")

SCALA:我已经用我的多个CSV文件进行了测试。目录的树结构是

.
|-- test=1
|   `-- test1.csv
`-- test=2
    `-- test2.csv

其中基本路径为

s3://bucket/test/

。对于每个CSV，都会显示内容

test1.csv

x,y,z
tes,45,34
tes,43,67
tes,56,43
raj,45,43
raj,44,67

test2.csv

x,y,z
shd,43,34
adf,2,67

指挥部呢

val df = spark.read.option("header","true").csv("s3a://bucket/test/")

df.show(false)

结果如下：

+---+---+---+----+
|x  |y  |z  |test|
+---+---+---+----+
|tes|45 |34 |1   |
|tes|43 |67 |1   |
|tes|56 |43 |1   |
|raj|45 |43 |1   |
|raj|44 |67 |1   |
|shd|43 |34 |2   |
|adf|2  |67 |2   |
+---+---+---+----+

PYSPARK

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

spark = SparkSession.builder \
  .master("yarn") \
  .appName("test") \
  .config("spark.hive.mapred.supports.subdirectories","true") \
  .config("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true") \
  .getOrCreate()

df = spark.read.option("header","true").csv("s3a://bucket/test/")
df.show(10, False)

+---+---+---+----+
|x  |y  |z  |test|
+---+---+---+----+
|tes|45 |34 |1   |
|tes|43 |67 |1   |
|tes|56 |43 |1   |
|raj|45 |43 |1   |
|raj|44 |67 |1   |
|shd|43 |34 |2   |
|adf|2  |67 |2   |
+---+---+---+----+

当我测试pyspark代码时，我没有中断代码。所以，请检查它是否正确。好吧，我把路径，比如

test=x

，它被认为是一个分区结构，所以结果是作为一列给出的。

你能解释得更好吗？是的。需要对这个不确定的回答问题做一些解释吗？它与原来的问题有关吗？我很确定我回答了你的问题。我可以问你为什么不能接受答案吗？我加倍-checked@thebluephantom我已经接受了。没有办法做到我所期望的那些投了否决票的人，我想说的是，我只是在自己检查后才发布了这个问题。这个网站有时很难，有时很不合理。