Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Json 如何使用pyspark explode()分解结构_Json_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Dataframes - Fatal编程技术网

Json 如何使用pyspark explode()分解结构

Json 如何使用pyspark explode()分解结构,json,apache-spark,pyspark,apache-spark-sql,pyspark-dataframes,Json,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Dataframes,如何将下面的JSON转换为后面的关系行?我一直关注的一个事实是pysparkexplode()函数由于类型不匹配而引发异常。我还没有找到一种方法将数据强制转换成合适的格式,这样我就可以在sample\u json对象中的source键中为每个对象创建行 JSON输入 sample_json = """ { "dc_id": "dc-101", "source": { "sensor-iga

如何将下面的JSON转换为后面的关系行?我一直关注的一个事实是pyspark
explode()
函数由于类型不匹配而引发异常。我还没有找到一种方法将数据强制转换成合适的格式,这样我就可以在
sample\u json
对象中的
source
键中为每个对象创建行

JSON输入

sample_json = """
{
"dc_id": "dc-101",
"source": {
    "sensor-igauge": {
      "id": 10,
      "ip": "68.28.91.22",
      "description": "Sensor attached to the container ceilings",
      "temp":35,
      "c02_level": 1475,
      "geo": {"lat":38.00, "long":97.00}                        
    },
    "sensor-ipad": {
      "id": 13,
      "ip": "67.185.72.1",
      "description": "Sensor ipad attached to carbon cylinders",
      "temp": 34,
      "c02_level": 1370,
      "geo": {"lat":47.41, "long":-122.00}
    },
    "sensor-inest": {
      "id": 8,
      "ip": "208.109.163.218",
      "description": "Sensor attached to the factory ceilings",
      "temp": 40,
      "c02_level": 1346,
      "geo": {"lat":33.61, "long":-111.89}
    },
    "sensor-istick": {
      "id": 5,
      "ip": "204.116.105.67",
      "description": "Sensor embedded in exhaust pipes in the ceilings",
      "temp": 40,
      "c02_level": 1574,
      "geo": {"lat":35.93, "long":-85.46}
    }
  }
}"""

期望输出

dc_id    source_name    id    description
-------------------------------------------------------------------------------
dc-101   sensor-gauge   10    Sensor attached to the container ceilings
dc-101   sensor-ipad    13    Sensor ipad attached to carbon cylinders
dc-101   sensor-inest    8    Sensor attached to the factory ceilings
dc-101   sensor-istick   5    Sensor embedded in exhaust pipes in the ceilings
Pypark代码

from pyspark.sql.functions import *
df_sample_data = spark.read.json(sc.parallelize([sample_json]))
df_expanded = df_sample_data.withColumn("one_source",explode_outer(col("source")))
display(df_expanded)
错误

AnalysisException:由于数据类型的原因,无法解析“explode(
source
)” 不匹配:函数explode的输入应为数组或映射类型,而不是 结构


我把这些放在一起是为了进一步演示挑战,并清楚地显示错误。我将能够使用此笔记本来测试本文提供的任何建议。

您不能对结构使用
分解
,但您可以在结构
(使用
df.select(“source.*”).columns
)中获取列名,并使用列表理解从每个嵌套结构创建所需字段的数组,然后分解以获得所需的结果:

from pyspark.sql import functions as F

df1 = df.select(
    "dc_id",
    F.explode(
        F.array(*[
            F.struct(
                F.lit(s).alias("source_name"),
                F.col(f"source.{s}.id").alias("id"),
                F.col(f"source.{s}.description").alias("description")
            )
            for s in df.select("source.*").columns
        ])
    ).alias("sources")

).select("dc_id", "sources.*") 

df1.show(truncate=False)

#+------+-------------+---+------------------------------------------------+
#|dc_id |source_name  |id |description                                     |
#+------+-------------+---+------------------------------------------------+
#|dc-101|sensor-igauge|10 |Sensor attached to the container ceilings       |
#|dc-101|sensor-inest |8  |Sensor attached to the factory ceilings         |
#|dc-101|sensor-ipad  |13 |Sensor ipad attached to carbon cylinders        |
#|dc-101|sensor-istick|5  |Sensor embedded in exhaust pipes in the ceilings|
#+------+-------------+---+------------------------------------------------+

在测试过程中,我找不到作为结构前缀的
*
F.array(*[
)的用途,所以我删除了它,但它没有任何效果。请指定
*
是否有用途,或者它是否可能是一个输入错误。@BradHein,实际上不是一个输入错误。这是因为函数采用了可变参数(
def array>)(*cols):
)。如果你使用像Pycharm这样的IDE,你应该得到一个警告
预期的类型“Union[Column,str]”,而不是List[Column]。我总是喜欢添加它