Apache spark 在pySpark中解析高度嵌套的JSON

Apache spark 在pySpark中解析高度嵌套的JSON,apache-spark,pyspark,Apache Spark,Pyspark,我试图解析/读取pyspark数据帧中的以下嵌套JSON。 即使pyspark推断模式或者我将模式传递给它,这也会失败 我正在运行这个AWS EMR集群 { "coffee": { "region": [ {"id":1,"name":"John Doe"}, {"id":2,"name":"Don Joe

我试图解析/读取pyspark数据帧中的以下嵌套JSON。 即使pyspark推断模式或者我将模式传递给它,这也会失败

我正在运行这个AWS EMR集群

{ 
"coffee": {
    "region": [
        {"id":1,"name":"John Doe"},
        {"id":2,"name":"Don Joeh"}
    ],
    "country": {"id":2,"company":"ACME"}
}, 
"brewing": {
    "region": [
        {"id":1,"name":"John Doe"},
        {"id":2,"name":"Don Joeh"}
    ],
    "country": {"id":2,"company":"ACME"}
}
}
Pyspark本身无法分析架构,并引发以下错误

    An error occurred while calling o745.json.
: java.lang.UnsupportedOperationException
    at org.apache.hadoop.fs.http.AbstractHttpFileSystem.listStatus(AbstractHttpFileSystem.java:91)
    at org.apache.hadoop.fs.http.HttpsFileSystem.listStatus(HttpsFileSystem.java:23)
    at org.apache.hadoop.fs.Globber.listStatus(Globber.java:77)
    at org.apache.hadoop.fs.Globber.doGlob(Globber.java:235)
    at org.apache.hadoop.fs.Globber.glob(Globber.java:149)
............
............
............

Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 274, in json
    return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
我通过传递我自己的模式进行了尝试,如下所示

代码:

c1_schema= StructType([StructField("id",IntegerType()), StructField("name",StringType())])
region_schema= StructField('region',ArrayType(c1_schema))
country_schema= StructField('country',StructType([StructField("id",IntegerType()), StructField("company",StringType())]))t_schema= StructType([StructField("coffee",StructType([region_schema,country_schema])),StructField("brewing",StructType([region_schema,country_schema]))])

df3= spark.read.option("multiline", "true").json(path1,t_schema)

除非JSON有问题,否则您的代码是完美的。这是我的版本(我在本地使用Spark 3.1.1)

obj=''
#写入文件
路径1='obj.json'
打开(路径1,'w')作为f:
f、 写(a)
f、 关闭()
#这里是您的确切模式
c1_schema=StructType([StructField(“id”,IntegerType()),StructField(“name”,StringType())]))
region_schema=StructField('region',ArrayType(c1_schema))
country\u schema=StructField('country',StructType([StructField(“id”,IntegerType()),StructField(“company”,StringType())]))
t_schema=StructType([StructField(“coffee”),StructType([region_schema,country_schema])),StructField(“brewing”,StructType([region_schema,country_schema]))
#这里是您的确切数据帧
df3=spark.read.option(“multiline”,“true”).json(路径1,t_模式)
和结果

# df3.printSchema()
root
 |-- coffee: struct (nullable = true)
 |    |-- region: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: integer (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |-- country: struct (nullable = true)
 |    |    |-- id: integer (nullable = true)
 |    |    |-- company: string (nullable = true)
 |-- brewing: struct (nullable = true)
 |    |-- region: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: integer (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |-- country: struct (nullable = true)
 |    |    |-- id: integer (nullable = true)
 |    |    |-- company: string (nullable = true)

# df3.show(10, False)
+-------------------------------------------+-------------------------------------------+
|coffee                                     |brewing                                    |
+-------------------------------------------+-------------------------------------------+
|{[{1, John Doe}, {2, Don Joeh}], {2, ACME}}|{[{1, John Doe}, {2, Don Joeh}], {2, ACME}}|
+-------------------------------------------+-------------------------------------------+

你会考虑使模式变平吗?对,甚至我可以运行这个代码。AWS的电子病历上似乎有些东西