Apache spark 在pySpark中解析高度嵌套的JSON_Apache Spark_Pyspark

Apache spark 在pySpark中解析高度嵌套的JSON

apache-spark pyspark

Apache spark 在pySpark中解析高度嵌套的JSON,apache-spark,pyspark,Apache Spark,Pyspark,我试图解析/读取pyspark数据帧中的以下嵌套JSON。即使pyspark推断模式或者我将模式传递给它，这也会失败我正在运行这个AWS EMR集群 { "coffee": { "region": [ {"id":1,"name":"John Doe"}, {"id":2,"name":"Don Joe

我试图解析/读取pyspark数据帧中的以下嵌套JSON。即使pyspark推断模式或者我将模式传递给它，这也会失败

我正在运行这个AWS EMR集群

{ 
"coffee": {
    "region": [
        {"id":1,"name":"John Doe"},
        {"id":2,"name":"Don Joeh"}
    ],
    "country": {"id":2,"company":"ACME"}
}, 
"brewing": {
    "region": [
        {"id":1,"name":"John Doe"},
        {"id":2,"name":"Don Joeh"}
    ],
    "country": {"id":2,"company":"ACME"}
}
}

Pyspark本身无法分析架构，并引发以下错误

    An error occurred while calling o745.json.
: java.lang.UnsupportedOperationException
    at org.apache.hadoop.fs.http.AbstractHttpFileSystem.listStatus(AbstractHttpFileSystem.java:91)
    at org.apache.hadoop.fs.http.HttpsFileSystem.listStatus(HttpsFileSystem.java:23)
    at org.apache.hadoop.fs.Globber.listStatus(Globber.java:77)
    at org.apache.hadoop.fs.Globber.doGlob(Globber.java:235)
    at org.apache.hadoop.fs.Globber.glob(Globber.java:149)
............
............
............

Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 274, in json
    return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)

我通过传递我自己的模式进行了尝试，如下所示

代码：

c1_schema= StructType([StructField("id",IntegerType()), StructField("name",StringType())])
region_schema= StructField('region',ArrayType(c1_schema))
country_schema= StructField('country',StructType([StructField("id",IntegerType()), StructField("company",StringType())]))t_schema= StructType([StructField("coffee",StructType([region_schema,country_schema])),StructField("brewing",StructType([region_schema,country_schema]))])

df3= spark.read.option("multiline", "true").json(path1,t_schema)

除非JSON有问题，否则您的代码是完美的。这是我的版本（我在本地使用Spark 3.1.1）

obj=''
#写入文件
路径1='obj.json'
打开（路径1，'w'）作为f：
f、 写（a）
f、 关闭（）
#这里是您的确切模式
c1_schema=StructType（[StructField（“id”，IntegerType（）），StructField（“name”，StringType（））]））
region_schema=StructField（'region'，ArrayType（c1_schema））
country\u schema=StructField（'country'，StructType（[StructField（“id”，IntegerType（）），StructField（“company”，StringType（））]））
t_schema=StructType（[StructField（“coffee”），StructType（[region_schema，country_schema]）），StructField（“brewing”，StructType（[region_schema，country_schema]））
#这里是您的确切数据帧
df3=spark.read.option（“multiline”，“true”）.json（路径1，t_模式）

和结果

# df3.printSchema()
root
 |-- coffee: struct (nullable = true)
 |    |-- region: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: integer (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |-- country: struct (nullable = true)
 |    |    |-- id: integer (nullable = true)
 |    |    |-- company: string (nullable = true)
 |-- brewing: struct (nullable = true)
 |    |-- region: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: integer (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |-- country: struct (nullable = true)
 |    |    |-- id: integer (nullable = true)
 |    |    |-- company: string (nullable = true)

# df3.show(10, False)
+-------------------------------------------+-------------------------------------------+
|coffee                                     |brewing                                    |
+-------------------------------------------+-------------------------------------------+
|{[{1, John Doe}, {2, Don Joeh}], {2, ACME}}|{[{1, John Doe}, {2, Don Joeh}], {2, ACME}}|
+-------------------------------------------+-------------------------------------------+

你会考虑使模式变平吗？对，甚至我可以运行这个代码。AWS的电子病历上似乎有些东西