Apache spark 在pyspark中将两个数据帧中的一个数据帧作为单独的子列

Apache spark 在pyspark中将两个数据帧中的一个数据帧作为单独的子列,apache-spark,pyspark,apache-spark-sql,pyspark-dataframes,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Dataframes,我想把两个数据帧放在一个数据帧中,所以每个数据帧都是子列,而不是数据帧的连接。我有两个数据帧,stat1_df和stat2_df,它们看起来像这样: root |-- max_scenes: integer (nullable = true) |-- median_scenes: double (nullable = false) |-- avg_scenes: double (nullable = true) +----------+-------------+------------

我想把两个数据帧放在一个数据帧中,所以每个数据帧都是子列,而不是数据帧的连接。我有两个数据帧,stat1_df和stat2_df,它们看起来像这样:

root
 |-- max_scenes: integer (nullable = true)
 |-- median_scenes: double (nullable = false)
 |-- avg_scenes: double (nullable = true)

+----------+-------------+------------------+
|max_scenes|median_scenes|avg_scenes        |
+----------+-------------+------------------+
|97        |7.0          |10.806451612903226|
|97        |7.0          |10.806451612903226|
|97        |7.0          |10.806451612903226|
|97        |7.0          |10.806451612903226|
+----------+-------------+------------------+


root
 |-- max: double (nullable = true)
 |-- type: string (nullable = true)

+-----+-----------+
|max  |type       |
+-----+-----------+
|10.0 |small      |
|25.0 |medium     |
|50.0 |large      |
|250.0|extra_large|
+-----+-----------+
,我希望结果为:

root
 |-- some_statistics1: struct (nullable = true)
 |    |-- max_scenes: integer (nullable = true)
      |-- median_scenes: double (nullable = false)
      |-- avg_scenes: double (nullable = true)
 |-- some_statistics2: struct (nullable = true)
 |    |-- max: double (nullable = true)
      |-- type: string (nullable = true)
有没有办法把这两个放在图中?stat1_df和stat2_df是简单的数据帧,没有数组和嵌套列。最后的数据帧写入mongodb。如果还有其他问题,我会在这里回答。

检查下面的代码

在两个数据框中添加
id
列,将所有列移动到struct中,然后使用
join
两个数据框的

scala> val dfa = Seq(("10","8.9","7.9")).toDF("max_scenes","median_scenes","avg_scenes")
dfa: org.apache.spark.sql.DataFrame = [max_scenes: string, median_scenes: string ... 1 more field]

scala> dfa.show(false)
+----------+-------------+----------+
|max_scenes|median_scenes|avg_scenes|
+----------+-------------+----------+
|10        |8.9          |7.9       |
+----------+-------------+----------+


scala> dfa.printSchema
root
 |-- max_scenes: string (nullable = true)
 |-- median_scenes: string (nullable = true)
 |-- avg_scenes: string (nullable = true)


scala> val mdfa = dfa.select(struct($"*").as("some_statistics1")).withColumn("id",monotonically_increasing_id)
mdfa: org.apache.spark.sql.DataFrame = [some_statistics1: struct<max_scenes: string, median_scenes: string ... 1 more field>, id: bigint]

scala> mdfa.printSchema
root
 |-- some_statistics1: struct (nullable = false)
 |    |-- max_scenes: string (nullable = true)
 |    |-- median_scenes: string (nullable = true)
 |    |-- avg_scenes: string (nullable = true)
 |-- id: long (nullable = false)


scala> mdfa.show(false)
+----------------+---+
|some_statistics1|id |
+----------------+---+
|[10,8.9,7.9]    |0  |
+----------------+---+


scala> val dfb = Seq(("11.2","sample")).toDF("max","type")
dfb: org.apache.spark.sql.DataFrame = [max: string, type: string]

scala> dfb.printSchema
root
 |-- max: string (nullable = true)
 |-- type: string (nullable = true)


scala> dfb.show(false)
+----+------+
|max |type  |
+----+------+
|11.2|sample|
+----+------+


scala> val mdfb = dfb.select(struct($"*").as("some_statistics2")).withColumn("id",monotonically_increasing_id)
mdfb: org.apache.spark.sql.DataFrame = [some_statistics2: struct<max: string, type: string>, id: bigint]

scala> mdfb.printSchema
root
 |-- some_statistics2: struct (nullable = false)
 |    |-- max: string (nullable = true)
 |    |-- type: string (nullable = true)
 |-- id: long (nullable = false)


scala> mdfb.show(false)
+----------------+---+
|some_statistics2|id |
+----------------+---+
|[11.2,sample]   |0  |
+----------------+---+


scala> mdfa.join(mdfb,Seq("id"),"inner").drop("id").printSchema
root
 |-- some_statistics1: struct (nullable = false)
 |    |-- max_scenes: string (nullable = true)
 |    |-- median_scenes: string (nullable = true)
 |    |-- avg_scenes: string (nullable = true)
 |-- some_statistics2: struct (nullable = false)
 |    |-- max: string (nullable = true)
 |    |-- type: string (nullable = true)


scala> mdfa.join(mdfb,Seq("id"),"inner").drop("id").show(false)
+----------------+----------------+
|some_statistics1|some_statistics2|
+----------------+----------------+
|[10,8.9,7.9]    |[11.2,sample]   |
+----------------+----------------+
scala>val-dfa=Seq((“10”、“8.9”、“7.9”)).toDF(“最大场景”、“中间场景”、“平均场景”)
dfa:org.apache.spark.sql.DataFrame=[max\u scenes:string,media\u scenes:string…还有一个字段]
scala>dfa.show(false)
+----------+-------------+----------+
|最大场景|中间场景|平均场景|
+----------+-------------+----------+
|10        |8.9          |7.9       |
+----------+-------------+----------+
scala>dfa.printSchema
根
|--max_场景:字符串(nullable=true)
|--场景:字符串(nullable=true)
|--平均场景:字符串(nullable=true)
scala>val mdfa=dfa.select(struct($“*”).as(“某些统计数据1”)。带列(“id”,单调递增的“id”)
mdfa:org.apache.spark.sql.DataFrame=[some_statistics1:struct,id:bigint]
scala>mdfa.printSchema
根
|--一些统计1:struct(nullable=false)
||--max_场景:字符串(nullable=true)
||--median_场景:字符串(nullable=true)
||--avg_场景:字符串(nullable=true)
|--id:long(nullable=false)
scala>mdfa.show(false)
+----------------+---+
|一些统计数字|
+----------------+---+
|[10,8.9,7.9]    |0  |
+----------------+---+
scala>val-dfb=Seq((“11.2”,“样本”).toDF(“最大”,“类型”)
dfb:org.apache.spark.sql.DataFrame=[max:string,type:string]
scala>dfb.printSchema
根
|--max:string(nullable=true)
|--类型:字符串(nullable=true)
scala>dfb.show(false)
+----+------+
|最大型|
+----+------+
|11.2 |样品|
+----+------+
scala>val mdfb=dfb.select(struct($“*”).as(“some_statistics 2”).withColumn(“id”,单调递增的\u id)
mdfb:org.apache.spark.sql.DataFrame=[some_statistics2:struct,id:bigint]
scala>mdfb.printSchema
根
|--一些统计2:struct(nullable=false)
||--max:string(nullable=true)
||--类型:字符串(nullable=true)
|--id:long(nullable=false)
scala>mdfb.show(false)
+----------------+---+
|一些统计数字2|
+----------------+---+
|[11.2,样本]| 0|
+----------------+---+
scala>mdfa.join(mdfb,Seq(“id”),“inner”).drop(“id”).printSchema
根
|--一些统计1:struct(nullable=false)
||--max_场景:字符串(nullable=true)
||--median_场景:字符串(nullable=true)
||--avg_场景:字符串(nullable=true)
|--一些统计2:struct(nullable=false)
||--max:string(nullable=true)
||--类型:字符串(nullable=true)
scala>mdfa.join(mdfb,Seq(“id”),“inner”).drop(“id”).show(false)
+----------------+----------------+
|一些统计1 |一些统计2|
+----------------+----------------+
|[10,8.9,7.9]|[11.2,样本]|
+----------------+----------------+

您如何知道stat1_df和stat2_df的哪些行属于同一行?它们不属于同一行。这是两个完全不同的数据帧。若你们看一看模式,你们会发现这两个数据帧应该是独立的子列,在这种情况下,它不是一个数据帧。数据帧是由行和列组成的二维结构。例如,当您显示5行根数据帧时,您将需要显示5行子数据帧。也许您想随机加入它们,或者创建一个包含两个数据帧的包装器类?请解释你的用例。你的数据帧有多少行?我编辑了我的问题,所以现在你应该更清楚了。它仍然没有回答我的问题。如何确定哪些行(而不是列)属于同一行?你们有几排?我编辑了问题,添加了数据帧的模式。所以,如果你能看一看,“因为它们不是嵌套的,它们是纯数据帧(谈论你的示例和我的示例之间的差异)是的,这更接近,但问题是,我在一些统计数据下得到了所有东西1。我再次编辑了这个问题,在这里我展示了数据示例。所以我没有一个表一行,但是4.你在使用上面的解决方案吗??如果是,你们能在你们的问题中发布这个解决方案的输出吗?是的,我使用了你们的解决方案,我得到了我需要的结果。谢谢(我有打字错误,所以我之前的评论提到了这一点)