Scala 使用嵌套字段更新数据框-Spark
我有两个数据帧,如下所示 Df1 Df2 以下是数据帧Df1的模式Scala 使用嵌套字段更新数据框-Spark,scala,apache-spark,dataframe,hadoop,apache-spark-sql,Scala,Apache Spark,Dataframe,Hadoop,Apache Spark Sql,我有两个数据帧,如下所示 Df1 Df2 以下是数据帧Df1的模式 root |-- products: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- id: string (nullable = true) | | |-- interest: double (nullable = true) |-- visitorId: string (nullab
root
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- interest: double (nullable = true)
|-- visitorId: string (nullable = true)
我想连接2个数据帧,以便输出
+------------------------------------------+---------+
|products |visitorId|
+------------------------------------------+---------+
|[[i1,0.68,Nike Shoes], [i2,0.42,Umbrella]]|v1 |
|[[i1,0.78,Nike Shoes], [i3,0.11,Jeans]] |v2 |
+------------------------------------------+---------+
这是我期望的输出模式
root
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- interest: double (nullable = true)
| | |-- name: double (nullable = true)
|-- visitorId: string (nullable = true)
在Scala我该怎么做?我正在使用Spark 2.2.0
更新
我在上面的数据帧上进行了分解和连接,得到了下面的输出
+---------+---+--------+----------+
|visitorId| id|interest| name|
+---------+---+--------+----------+
| v1| i1| 0.68|Nike Shoes|
| v1| i2| 0.42| Umbrella|
| v2| i1| 0.78|Nike Shoes|
| v2| i3| 0.11| Jeans|
+---------+---+--------+----------+
现在,我只需要下面json格式的上述数据帧
{
"visitorId": "v1",
"products": [{
"id": "i1",
"name": "Nike Shoes",
"interest": 0.68
}, {
"id": "i2",
"name": "Umbrella",
"interest": 0.42
}]
},
{
"visitorId": "v2",
"products": [{
"id": "i1",
"name": "Nike Shoes",
"interest": 0.78
}, {
"id": "i3",
"name": "Jeans",
"interest": 0.11
}]
}
试试这个
scala> val df1 = Seq((Seq(("i1",0.68),("i2",0.42)), "v1"), (Seq(("i1",0.78),("i3",0.11)), "v2")).toDF("products", "visitorId" )
df: org.apache.spark.sql.DataFrame = [products: array<struct<_1:string,_2:double>>, visitorId: string]
scala> df1.show(false)
+------------------------+---------+
|products |visitorId|
+------------------------+---------+
|[[i1, 0.68], [i2, 0.42]]|v1 |
|[[i1, 0.78], [i3, 0.11]]|v2 |
+------------------------+---------+
scala> val df2 = Seq(("i1", "Nike Shoes"),("i2", "Umbrella"), ("i3", "Jeans")).toDF("id", "name")
df2: org.apache.spark.sql.DataFrame = [id: string, name: string]
scala> df2.show(false)
+---+----------+
|id |name |
+---+----------+
|i1 |Nike Shoes|
|i2 |Umbrella |
|i3 |Jeans |
+---+----------+
scala> val withProductsDF = df1.withColumn("individualproducts", explode($"products")).select($"visitorId",$"products",$"individualproducts._1" as "id", $"individualproducts._2" as "interest")
withProductsDF: org.apache.spark.sql.DataFrame = [visitorId: string, products: array<struct<_1:string,_2:double>> ... 2 more fields]
scala> withProductsDF.show(false)
+---------+------------------------+---+--------+
|visitorId|products |id |interest|
+---------+------------------------+---+--------+
|v1 |[[i1, 0.68], [i2, 0.42]]|i1 |0.68 |
|v1 |[[i1, 0.68], [i2, 0.42]]|i2 |0.42 |
|v2 |[[i1, 0.78], [i3, 0.11]]|i1 |0.78 |
|v2 |[[i1, 0.78], [i3, 0.11]]|i3 |0.11 |
+---------+------------------------+---+--------+
scala> val withProductNamesDF = withProductsDF.join(df2, "id")
withProductNamesDF: org.apache.spark.sql.DataFrame = [id: string, visitorId: string ... 3 more fields]
scala> withProductNamesDF.show(false)
+---+---------+------------------------+--------+----------+
|id |visitorId|products |interest|name |
+---+---------+------------------------+--------+----------+
|i1 |v2 |[[i1, 0.78], [i3, 0.11]]|0.78 |Nike Shoes|
|i1 |v1 |[[i1, 0.68], [i2, 0.42]]|0.68 |Nike Shoes|
|i2 |v1 |[[i1, 0.68], [i2, 0.42]]|0.42 |Umbrella |
|i3 |v2 |[[i1, 0.78], [i3, 0.11]]|0.11 |Jeans |
+---+---------+------------------------+--------+----------+
scala> val outputDF = withProductNamesDF.groupBy("visitorId").agg(collect_list(struct($"id", $"name", $"interest")) as "products")
outputDF: org.apache.spark.sql.DataFrame = [visitorId: string, products: array<struct<id:string,name:string,interest:double>>]
scala> outputDF.toJSON.show(false)
+-----------------------------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------------------------+
|{"visitorId":"v2","products":[{"id":"i1","name":"Nike Shoes","interest":0.78},{"id":"i3","name":"Jeans","interest":0.11}]} |
|{"visitorId":"v1","products":[{"id":"i1","name":"Nike Shoes","interest":0.68},{"id":"i2","name":"Umbrella","interest":0.42}]}|
+-----------------------------------------------------------------------------------------------------------------------------+
scala>val df1=Seq((Seq(((“i1”,0.68),(“i2”,0.42)),“v1”),(Seq((“i1”,0.78),(“i3”,0.11)),“v2”)。toDF(“产品”,“visitorId”)
df:org.apache.spark.sql.DataFrame=[产品:数组,visitorId:string]
scala>df1.show(false)
+------------------------+---------+
|产品|参观|
+------------------------+---------+
|[i1,0.68],[i2,0.42]| v1|
|[i1,0.78],[i3,0.11]| v2|
+------------------------+---------+
scala>val df2=序列((“i1”、“Nike鞋”)、(“i2”、“雨伞”)、(“i3”、“牛仔裤”)。toDF(“id”、“名称”)
df2:org.apache.spark.sql.DataFrame=[id:string,name:string]
scala>df2.show(false)
+---+----------+
|id |名称|
+---+----------+
|i1 |耐克鞋|
|i2 |伞|
|i3 |牛仔裤|
+---+----------+
scala>val withProductsDF=df1。withColumn(“individualproducts”,explode($“products”))。选择($“visitorId”、$“products”、$“individualproducts.”作为“id”、$“individualproducts.”作为“兴趣”)
withProductsDF:org.apache.spark.sql.DataFrame=[visitorId:string,products:array…另外两个字段]
scala>withProductsDF.show(false)
+---------+------------------------+---+--------+
|访客|产品|身份|兴趣|
+---------+------------------------+---+--------+
|v1 |[i1,0.68],[i2,0.42]| i1 | 0.68|
|v1 |[[i1,0.68],[i2,0.42]| i2 | 0.42|
|v2 |[[i1,0.78],[i3,0.11]| i1 | 0.78|
|v2 |[i1,0.78],[i3,0.11]| i3 | 0.11|
+---------+------------------------+---+--------+
scala>val withProductNamesDF=withProductsDF.join(df2,“id”)
withProductNamesDF:org.apache.spark.sql.DataFrame=[id:string,visitorId:string…其他3个字段]
scala>withProductNamesDF.show(false)
+---+---------+------------------------+--------+----------+
|id |访客|产品|兴趣|名称|
+---+---------+------------------------+--------+----------+
|i1 | v2 |[i1,0.78],[i3,0.11]]| 0.78 |耐克鞋|
|i1 | v1 |[i1,0.68],[i2,0.42]]| 0.68 |耐克鞋|
|i2 | v1 |[[i1,0.68],[i2,0.42]]| 0.42 |伞|
|i3 | v2 |[i1,0.78],[i3,0.11]]| 0.11 |牛仔裤|
+---+---------+------------------------+--------+----------+
scala>val outputDF=withProductNamesDF.groupBy(“visitorId”).agg(收集列表(结构($“id”,“名称”,“兴趣”))作为“产品”)
outputDF:org.apache.spark.sql.DataFrame=[visitorId:string,products:array]
scala>outputDF.toJSON.show(false)
+-----------------------------------------------------------------------------------------------------------------------------+
|价值观|
+-----------------------------------------------------------------------------------------------------------------------------+
|{“visitorId”:“v2”,“产品”:[{“id”:“i1”,“名称”:“耐克鞋”,“兴趣”:0.78},{“id”:“i3”,“名称”:“牛仔裤”,“兴趣”:0.11}]}|
|{“visitorId”:“v1”,“产品”:[{“id”:“i1”,“名称”:“耐克鞋”,“兴趣”:0.68},{“id”:“i2”,“名称”:“伞”,“兴趣”:0.42}]}|
+-----------------------------------------------------------------------------------------------------------------------------+
取决于您的具体情况,但如果df2查找表足够小,您可以尝试将其收集为Scala映射,以便在UDF中使用。因此,它变得非常简单:
val m = df2.as[(String, String)].collect.toMap
val addName = udf( (arr: Seq[Row]) => {
arr.map(i => (i.getAs[String](0), i.getAs[Double](1), m(i.getAs[String](0))))
})
df1.withColumn("products", addName('products)).show(false)
+------------------------------------------+---------+
|products |visitorId|
+------------------------------------------+---------+
|[[i1,0.68,Nike Shoes], [i2,0.42,Umbrella]]|v1 |
|[[i1,0.78,Nike Shoes], [i3,0.11,Jeans]] |v2 |
+------------------------------------------+---------+
分解、加入、收集列表?
scala> val df1 = Seq((Seq(("i1",0.68),("i2",0.42)), "v1"), (Seq(("i1",0.78),("i3",0.11)), "v2")).toDF("products", "visitorId" )
df: org.apache.spark.sql.DataFrame = [products: array<struct<_1:string,_2:double>>, visitorId: string]
scala> df1.show(false)
+------------------------+---------+
|products |visitorId|
+------------------------+---------+
|[[i1, 0.68], [i2, 0.42]]|v1 |
|[[i1, 0.78], [i3, 0.11]]|v2 |
+------------------------+---------+
scala> val df2 = Seq(("i1", "Nike Shoes"),("i2", "Umbrella"), ("i3", "Jeans")).toDF("id", "name")
df2: org.apache.spark.sql.DataFrame = [id: string, name: string]
scala> df2.show(false)
+---+----------+
|id |name |
+---+----------+
|i1 |Nike Shoes|
|i2 |Umbrella |
|i3 |Jeans |
+---+----------+
scala> val withProductsDF = df1.withColumn("individualproducts", explode($"products")).select($"visitorId",$"products",$"individualproducts._1" as "id", $"individualproducts._2" as "interest")
withProductsDF: org.apache.spark.sql.DataFrame = [visitorId: string, products: array<struct<_1:string,_2:double>> ... 2 more fields]
scala> withProductsDF.show(false)
+---------+------------------------+---+--------+
|visitorId|products |id |interest|
+---------+------------------------+---+--------+
|v1 |[[i1, 0.68], [i2, 0.42]]|i1 |0.68 |
|v1 |[[i1, 0.68], [i2, 0.42]]|i2 |0.42 |
|v2 |[[i1, 0.78], [i3, 0.11]]|i1 |0.78 |
|v2 |[[i1, 0.78], [i3, 0.11]]|i3 |0.11 |
+---------+------------------------+---+--------+
scala> val withProductNamesDF = withProductsDF.join(df2, "id")
withProductNamesDF: org.apache.spark.sql.DataFrame = [id: string, visitorId: string ... 3 more fields]
scala> withProductNamesDF.show(false)
+---+---------+------------------------+--------+----------+
|id |visitorId|products |interest|name |
+---+---------+------------------------+--------+----------+
|i1 |v2 |[[i1, 0.78], [i3, 0.11]]|0.78 |Nike Shoes|
|i1 |v1 |[[i1, 0.68], [i2, 0.42]]|0.68 |Nike Shoes|
|i2 |v1 |[[i1, 0.68], [i2, 0.42]]|0.42 |Umbrella |
|i3 |v2 |[[i1, 0.78], [i3, 0.11]]|0.11 |Jeans |
+---+---------+------------------------+--------+----------+
scala> val outputDF = withProductNamesDF.groupBy("visitorId").agg(collect_list(struct($"id", $"name", $"interest")) as "products")
outputDF: org.apache.spark.sql.DataFrame = [visitorId: string, products: array<struct<id:string,name:string,interest:double>>]
scala> outputDF.toJSON.show(false)
+-----------------------------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------------------------+
|{"visitorId":"v2","products":[{"id":"i1","name":"Nike Shoes","interest":0.78},{"id":"i3","name":"Jeans","interest":0.11}]} |
|{"visitorId":"v1","products":[{"id":"i1","name":"Nike Shoes","interest":0.68},{"id":"i2","name":"Umbrella","interest":0.42}]}|
+-----------------------------------------------------------------------------------------------------------------------------+
val m = df2.as[(String, String)].collect.toMap
val addName = udf( (arr: Seq[Row]) => {
arr.map(i => (i.getAs[String](0), i.getAs[Double](1), m(i.getAs[String](0))))
})
df1.withColumn("products", addName('products)).show(false)
+------------------------------------------+---------+
|products |visitorId|
+------------------------------------------+---------+
|[[i1,0.68,Nike Shoes], [i2,0.42,Umbrella]]|v1 |
|[[i1,0.78,Nike Shoes], [i3,0.11,Jeans]] |v2 |
+------------------------------------------+---------+