Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Dataframe 将pyspark中的嵌套数据帧展平为列_Dataframe_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Dataframe 将pyspark中的嵌套数据帧展平为列

Dataframe 将pyspark中的嵌套数据帧展平为列,dataframe,apache-spark,pyspark,apache-spark-sql,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,您好,我有JSON数据,我正在pyspark中提取,示例如下 { "data": [ ["row-r9pv-p86t.ifsp", "00000000-0000-0000-0838-60C2FFCC43AE", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOEY", "KING

您好,我有JSON数据,我正在pyspark中提取,示例如下

{
    "data": [
        ["row-r9pv-p86t.ifsp", "00000000-0000-0000-0838-60C2FFCC43AE", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOEY", "KINGS", "F", "11"],
        ["row-7v2v~88z5-44se", "00000000-0000-0000-C8FC-DDD3F9A72DFF", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOEY", "SUFFOLK", "F", "6"],
        ["row-hzc9-4kvv~mbc9", "00000000-0000-0000-562E-D9A0792557FC", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOEY", "MONROE", "F", "6"]
    ]
}
我试图分解多个数组,并将每条记录分解为一行数据帧,但结果如下所示:

df= spark.read.json('data/rows.json', multiLine=True)
temp_df = df.select(explode("data").alias("data"))
temp_df.show(n=3, truncate=False)
结果:

+-----------------------------------------------------------------------------------------------------------------------+
|data                                                                                                                   |
+-----------------------------------------------------------------------------------------------------------------------+
|[row-r9pv-p86t.ifsp, 00000000-0000-0000-0838-60C2FFCC43AE, 0, 1574264158,, 1574264158,, { }, 2007, ZOEY, KINGS, F, 11] |
|[row-7v2v~88z5-44se, 00000000-0000-0000-C8FC-DDD3F9A72DFF, 0, 1574264158,, 1574264158,, { }, 2007, ZOEY, SUFFOLK, F, 6]|
|[row-hzc9-4kvv~mbc9, 00000000-0000-0000-562E-D9A0792557FC, 0, 1574264158,, 1574264158,, { }, 2007, ZOEY, MONROE, F, 6] |
+-----------------------------------------------------------------------------------------------------------------------+
到目前为止还不错,但是当我尝试使用
flatte
方法展平每一行数据帧中的数组时,它给出了一个错误提示 由于数据类型不匹配,
无法解析“展平('data')”:参数应该是数组的数组,但“data”是数组类型。
这很有意义,但我不确定如何展平数组

我是否应该编写任何自定义的
map
方法来将行数组映射到数据帧列

V 2(与柱和混凝土一起使用):


回答我自己的问题。所以它可以帮助任何需要帮助的人

  • 从文件中读取源数据
  • 结果:

    +-----------------------------------------------------------------------------------------------------------------------+
    |data                                                                                                                   |
    +-----------------------------------------------------------------------------------------------------------------------+
    |[row-r9pv-p86t.ifsp, 00000000-0000-0000-0838-60C2FFCC43AE, 0, 1574264158,, 1574264158,, { }, 2007, ZOEY, KINGS, F, 11] |
    |[row-7v2v~88z5-44se, 00000000-0000-0000-C8FC-DDD3F9A72DFF, 0, 1574264158,, 1574264158,, { }, 2007, ZOEY, SUFFOLK, F, 6]|
    |[row-hzc9-4kvv~mbc9, 00000000-0000-0000-562E-D9A0792557FC, 0, 1574264158,, 1574264158,, { }, 2007, ZOEY, MONROE, F, 6] |
    +-----------------------------------------------------------------------------------------------------------------------+
    
    在上面的数据框中,每个单元格都包含一个字符串数组,我需要的是单独列中的每个元素和特定的数据类型

    df = temp_df.withColumn("sid", temp_df["data"].getItem(0).cast(StringType())) \
           .withColumn("id", temp_df["data"].getItem(1).cast(IntegerType())) \
           .withColumn("position", temp_df["data"].getItem(2).cast(IntegerType())) \
           .withColumn("created_at", temp_df["data"].getItem(3).cast(TimestampType())) \
           .withColumn("created_meta", temp_df["data"].getItem(4).cast(StringType())) \
           .withColumn("updated_at", temp_df["data"].getItem(5).cast(TimestampType())) \
           .withColumn("updated_meta", temp_df["data"].getItem(6).cast(StringType())) \
           .withColumn("meta", temp_df["data"].getItem(7).cast(StringType())) \
           .withColumn("Year", (temp_df["data"].getItem(8)).cast(IntegerType())) \
           .withColumn("First Name", temp_df["data"].getItem(9).cast(StringType())) \
           .withColumn("County", temp_df["data"].getItem(10).cast(StringType())) \
           .withColumn("Sex", temp_df["data"].getItem(11).cast(StringType())) \
           .withColumn("Count", temp_df["data"].getItem(12).cast(IntegerType())) \
           .drop("data")
    df.show()
    df.printSchema()
    

    谢谢你的意见。我也尝试过同样的方法,但问题是这个项目应该是struct类型,而我有一个字符串数组。下面是错误
    无法从数据中提取值#469[getItem(0)]:需要结构类型,但得到字符串仅供参考,我正在使用pyspark。我已在问题陈述的末尾添加了架构图像。对不起,请不要使用pyspark,请参阅版本2。-请看
    
     val sourceDF = Seq(
        Array("row-r9pv-p86t.ifsp", "00000000-0000-0000-0838-60C2FFCC43AE", "0", "1574264158", "", "1574264158", "", "{ }", "2007", "ZOEY", "KINGS", "F", "11"),
        Array("row-7v2v~88z5-44se", "00000000-0000-0000-C8FC-DDD3F9A72DFF", "0", "1574264158", "", "1574264158", "", "{ }", "2007", "ZOEY", "SUFFOLK", "F", "6"),
        Array("row-hzc9-4kvv~mbc9", "00000000-0000-0000-562E-D9A0792557FC", "0", "1574264158", "", "1574264158", "", "{ }", "2007", "ZOEY", "MONROE", "F", "6")
      ).toDF("dataColumn")
    
      sourceDF.show(false)
    
    //  +-------------------------------------------------------------------------------------------------------------------------+
    //  |dataColumn                                                                                                               |
    //  +-------------------------------------------------------------------------------------------------------------------------+
    //  |[row-r9pv-p86t.ifsp, 00000000-0000-0000-0838-60C2FFCC43AE, 0, 1574264158, , 1574264158, , { }, 2007, ZOEY, KINGS, F, 11] |
    //  |[row-7v2v~88z5-44se, 00000000-0000-0000-C8FC-DDD3F9A72DFF, 0, 1574264158, , 1574264158, , { }, 2007, ZOEY, SUFFOLK, F, 6]|
    //  |[row-hzc9-4kvv~mbc9, 00000000-0000-0000-562E-D9A0792557FC, 0, 1574264158, , 1574264158, , { }, 2007, ZOEY, MONROE, F, 6] |
    //  +-------------------------------------------------------------------------------------------------------------------------+
    
    
      val df1 = sourceDF
        .withColumn("dataString", concat_ws(", ", 'dataColumn))
        .select('dataString)
    
      df1.printSchema()
    
      df1.show(false)
    //  root
    //  |-- dataString: string (nullable = false)
    //
    //  +-----------------------------------------------------------------------------------------------------------------------+
    //  |dataString                                                                                                             |
    //  +-----------------------------------------------------------------------------------------------------------------------+
    //  |row-r9pv-p86t.ifsp, 00000000-0000-0000-0838-60C2FFCC43AE, 0, 1574264158, , 1574264158, , { }, 2007, ZOEY, KINGS, F, 11 |
    //  |row-7v2v~88z5-44se, 00000000-0000-0000-C8FC-DDD3F9A72DFF, 0, 1574264158, , 1574264158, , { }, 2007, ZOEY, SUFFOLK, F, 6|
    //  |row-hzc9-4kvv~mbc9, 00000000-0000-0000-562E-D9A0792557FC, 0, 1574264158, , 1574264158, , { }, 2007, ZOEY, MONROE, F, 6 |
    //  +-----------------------------------------------------------------------------------------------------------------------+
    
      val df2 = df1.select(
        split('dataString, ", ").getItem(0).alias("c0"),
        split('dataString, ", ").getItem(1).alias("c1"),
        split('dataString, ", ").getItem(2).alias("c2"),
        split('dataString, ", ").getItem(3).alias("c3"),
        split('dataString, ", ").getItem(4).alias("c4"),
        split('dataString, ", ").getItem(5).alias("c5"),
        split('dataString, ", ").getItem(6).alias("c6"),
        split('dataString, ", ").getItem(7).alias("c7"),
        split('dataString, ", ").getItem(8).alias("c8"),
        split('dataString, ", ").getItem(9).alias("c9"),
        split('dataString, ", ").getItem(10).alias("c10"),
        split('dataString, ", ").getItem(11).alias("c11"),
        split('dataString, ", ").getItem(12).alias("c12")
      )
      df2.printSchema()
    //  root
    //  |-- c0: string (nullable = true)
    //  |-- c1: string (nullable = true)
    //  |-- c2: string (nullable = true)
    //  |-- c3: string (nullable = true)
    //  |-- c4: string (nullable = true)
    //  |-- c5: string (nullable = true)
    //  |-- c6: string (nullable = true)
    //  |-- c7: string (nullable = true)
    //  |-- c8: string (nullable = true)
    //  |-- c9: string (nullable = true)
    //  |-- c10: string (nullable = true)
    //  |-- c11: string (nullable = true)
    //  |-- c12: string (nullable = true)
    
      df2.show(false)
    //  +------------------+------------------------------------+---+----------+---+----------+---+---+----+----+-------+---+---+
    //  |c0                |c1                                  |c2 |c3        |c4 |c5        |c6 |c7 |c8  |c9  |c10    |c11|c12|
    //  +------------------+------------------------------------+---+----------+---+----------+---+---+----+----+-------+---+---+
    //  |row-r9pv-p86t.ifsp|00000000-0000-0000-0838-60C2FFCC43AE|0  |1574264158|   |1574264158|   |{ }|2007|ZOEY|KINGS  |F  |11 |
    //  |row-7v2v~88z5-44se|00000000-0000-0000-C8FC-DDD3F9A72DFF|0  |1574264158|   |1574264158|   |{ }|2007|ZOEY|SUFFOLK|F  |6  |
    //  |row-hzc9-4kvv~mbc9|00000000-0000-0000-562E-D9A0792557FC|0  |1574264158|   |1574264158|   |{ }|2007|ZOEY|MONROE |F  |6  |
    //  +------------------+------------------------------------+---+----------+---+----------+---+---+----+----+-------+---+---+
    
    df= spark.read.json('data/rows.json', multiLine=True)
    temp_df = df.select(explode("data").alias("data"))
    temp_df.show(n=3, truncate=False)
    
    +-----------------------------------------------------------------------------------------------------------------------+
    |data                                                                                                                   |
    +-----------------------------------------------------------------------------------------------------------------------+
    |[row-r9pv-p86t.ifsp, 00000000-0000-0000-0838-60C2FFCC43AE, 0, 1574264158,, 1574264158,, { }, 2007, ZOEY, KINGS, F, 11] |
    |[row-7v2v~88z5-44se, 00000000-0000-0000-C8FC-DDD3F9A72DFF, 0, 1574264158,, 1574264158,, { }, 2007, ZOEY, SUFFOLK, F, 6]|
    |[row-hzc9-4kvv~mbc9, 00000000-0000-0000-562E-D9A0792557FC, 0, 1574264158,, 1574264158,, { }, 2007, ZOEY, MONROE, F, 6] |
    +-----------------------------------------------------------------------------------------------------------------------+
    
    df = temp_df.withColumn("sid", temp_df["data"].getItem(0).cast(StringType())) \
           .withColumn("id", temp_df["data"].getItem(1).cast(IntegerType())) \
           .withColumn("position", temp_df["data"].getItem(2).cast(IntegerType())) \
           .withColumn("created_at", temp_df["data"].getItem(3).cast(TimestampType())) \
           .withColumn("created_meta", temp_df["data"].getItem(4).cast(StringType())) \
           .withColumn("updated_at", temp_df["data"].getItem(5).cast(TimestampType())) \
           .withColumn("updated_meta", temp_df["data"].getItem(6).cast(StringType())) \
           .withColumn("meta", temp_df["data"].getItem(7).cast(StringType())) \
           .withColumn("Year", (temp_df["data"].getItem(8)).cast(IntegerType())) \
           .withColumn("First Name", temp_df["data"].getItem(9).cast(StringType())) \
           .withColumn("County", temp_df["data"].getItem(10).cast(StringType())) \
           .withColumn("Sex", temp_df["data"].getItem(11).cast(StringType())) \
           .withColumn("Count", temp_df["data"].getItem(12).cast(IntegerType())) \
           .drop("data")
    df.show()
    df.printSchema()
    
    +------------------+----+--------+----------+------------+----------+------------+----+----+----------+-------+---+-----+
    |               sid|  id|position|created_at|created_meta|updated_at|updated_meta|meta|Year|First Name| County|Sex|Count|
    +------------------+----+--------+----------+------------+----------+------------+----+----+----------+-------+---+-----+
    |row-r9pv-p86t.ifsp|null|       0|      null|        null|      null|        null| { }|2007|      ZOEY|  KINGS|  F|   11|
    |row-7v2v~88z5-44se|null|       0|      null|        null|      null|        null| { }|2007|      ZOEY|SUFFOLK|  F|    6|
    |row-hzc9-4kvv~mbc9|null|       0|      null|        null|      null|        null| { }|2007|      ZOEY| MONROE|  F|    6|
    +------------------+----+--------+----------+------------+----------+------------+----+----+----------+-------+---+-----+
    
    ==================== SCHEMA ====================
    
    root
     |-- sid: string (nullable = true)
     |-- id: integer (nullable = true)
     |-- position: integer (nullable = true)
     |-- created_at: timestamp (nullable = true)
     |-- created_meta: string (nullable = true)
     |-- updated_at: timestamp (nullable = true)
     |-- updated_meta: string (nullable = true)
     |-- meta: string (nullable = true)
     |-- Year: integer (nullable = true)
     |-- First Name: string (nullable = true)
     |-- County: string (nullable = true)
     |-- Sex: string (nullable = true)
     |-- Count: integer (nullable = true)