Python 皮斯帕克。阅读拼花时通过强制转换为字符串合并模式？_Python_Apache Spark_Pyspark

Python 皮斯帕克。阅读拼花时通过强制转换为字符串合并模式？

python apache-spark pyspark

Python 皮斯帕克。阅读拼花时通过强制转换为字符串合并模式？,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我从拼花文件中读取数据，该文件有一个Map type字段，如下所示： >>> df = spark.read.parquet('path/to/partiton') >>> df.collect() Row(field={'a': 'SomeString', 'b': '1234'}) >>> df.printSchema() field: map (containsNull = true) |-- key: string |-- va

我从拼花文件中读取数据，该文件有一个Map type字段，如下所示：

>>> df = spark.read.parquet('path/to/partiton')
>>> df.collect()
Row(field={'a': 'SomeString', 'b': '1234'})

>>> df.printSchema()
field: map (containsNull = true)
 |-- key: string
 |-- value: string(valueContainsNull = true)

>>> struct = StructType([ StructField('field', MapType(StringType(), StringType())) ])
>>> df = spark.read.schema(struct).parquet('path/to/')
>>> df.collect()
fails with same error

问题在于，在其他分区中，键

为

None

，导致键

被读取为

long

类型：

>>> df = spark.read.parquet('path/to/otherPartiton')
>>> df.collect()
Row(field={'a': None, 'b': 1234})

>>> df.printSchema()
field: map (containsNull = true)
 |-- key: string
 |-- value: long(valueContainsNull = true)

这会在同时读取所有分区时产生冲突模式：

>>> df = spark.read.parquet('path/to/')
>>> df.collect()
SparkException: ... java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

我已尝试手动指定模式，如下所示：

>>> df = spark.read.parquet('path/to/partiton')
>>> df.collect()
Row(field={'a': 'SomeString', 'b': '1234'})

>>> df.printSchema()
field: map (containsNull = true)
 |-- key: string
 |-- value: string(valueContainsNull = true)

>>> struct = StructType([ StructField('field', MapType(StringType(), StringType())) ])
>>> df = spark.read.schema(struct).parquet('path/to/')
>>> df.collect()
fails with same error

有办法解决这个问题吗？我是否被迫重写错误的分区？

您可以在数据帧上强制执行自己的模式

模式定义在您的应用程序代码中，您可以如下所示

df = spark.read.csv(filename, header=True, nullValue='NA', schema=customschema)
df.show()

您可以在数据帧上实施自己的模式

模式定义在您的应用程序代码中，您可以如下所示

df = spark.read.csv(filename, header=True, nullValue='NA', schema=customschema)
df.show()

我已经尝试过使用自定义模式。阅读问题的最后一部分。此外，我使用的是拼花地板，而不是csvI。我已经尝试过使用自定义模式。阅读问题的最后一部分。此外，我使用的是拼花地板，而不是CSVWWWY。您为什么不使用数据集？@IvanMilasevic数据集在PySparkWY中不可用？您为什么不使用数据集？@IvanMilasevic数据集在pyspark中不可用