Mysql 在spark sql中转换两个数据帧
我在spark scala中将两个数据帧注册为表。 从这两张桌子 表1:Mysql 在spark sql中转换两个数据帧,mysql,scala,apache-spark,apache-spark-sql,Mysql,Scala,Apache Spark,Apache Spark Sql,我在spark scala中将两个数据帧注册为表。 从这两张桌子 表1: +-----+--------+ |id |values | +-----+----- + | 0 | v1 | | 0 | v2 | | 1 | v3 | | 1 | v1 | +-----+----- + 表2: +-----+----+--- +----+ |id |v1 |v2 | v
+-----+--------+
|id |values |
+-----+----- +
| 0 | v1 |
| 0 | v2 |
| 1 | v3 |
| 1 | v1 |
+-----+----- +
表2:
+-----+----+--- +----+
|id |v1 |v2 | v3
+-----+-------- +----+
| 0 | a1| b1| - |
| 1 | a2| - | c2 |
+-----+---------+----+
我想使用上述两个表生成一个新表
表3:
+-----+--------+--------+
|id |values | field |
+-----+--------+--------+
| 0 | v1 | a1 |
| 0 | v2 | b1 |
| 1 | v3 | c2 |
| 1 | v1 | a2 |
+-----+--------+--------+
这里是v1的形式
v1: struct (nullable = true)
| |-- level1: string (nullable = true)
| |-- level2: string (nullable = true)
| |-- level3: string (nullable = true)
| |-- level4: string (nullable = true)
| |-- level5: string (nullable = true)
我正在scala中使用spark sql
是否可以通过编写一些sql查询或在数据帧上使用一些spark函数来完成所需的操作。以下是您可以使用的示例代码,它将生成此输出: 代码如下所示:
val df1=sc.parallelize(Seq((0,"v1"),(0,"v2"),(1,"v3"),(1,"v1"))).toDF("id","values")
val df2=sc.parallelize(Seq((0,"a1","b1","-"),(1,"a2","-","b2"))).toDF("id","v1","v2","v3")
val joinedDF=df1.join(df2,"id")
val resultDF=joinedDF.rdd.map{row=>
val id=row.getAs[Int]("id")
val values=row.getAs[String]("values")
val feilds=row.getAs[String](values)
(id,values,feilds)
}.toDF("id","values","feilds")
在控制台上测试时:
我希望这可能是你的问题。谢谢 我试着插入table1字段,值从表2中选择列名称,其中table1.id=table2.id,但为此,我需要从表1中动态选择列名称。列名称列表在表2中是有限的吗?列名的数量事先不知道,它们的数量与column2Like@eliasah中的不同值相同,我想看看自己在解决这个问题上的努力。如果我的值是struct类型而不是string,那么在提取时会有什么变化?然后使用struct而不是string,然后对其进行相应的转换。感谢帮助,这里v1、v2、v3是struct类型,子字段为level1:string,level2:string。现在,当我尝试getAs[Struct]值时,它会说找不到Struct类型……如何执行此操作查看以下内容:
scala> val df1=sc.parallelize(Seq((0,"v1"),(0,"v2"),(1,"v3"),(1,"v1"))).toDF("id","values")
df1: org.apache.spark.sql.DataFrame = [id: int, values: string]
scala> df1.show
+---+------+
| id|values|
+---+------+
| 0| v1|
| 0| v2|
| 1| v3|
| 1| v1|
+---+------+
scala> val df2=sc.parallelize(Seq((0,"a1","b1","-"),(1,"a2","-","b2"))).toDF("id","v1","v2","v3")
df2: org.apache.spark.sql.DataFrame = [id: int, v1: string ... 2 more fields]
scala> df2.show
+---+---+---+---+
| id| v1| v2| v3|
+---+---+---+---+
| 0| a1| b1| -|
| 1| a2| -| b2|
+---+---+---+---+
scala> val joinedDF=df1.join(df2,"id")
joinedDF: org.apache.spark.sql.DataFrame = [id: int, values: string ... 3 more fields]
scala> joinedDF.show
+---+------+---+---+---+
| id|values| v1| v2| v3|
+---+------+---+---+---+
| 1| v3| a2| -| b2|
| 1| v1| a2| -| b2|
| 0| v1| a1| b1| -|
| 0| v2| a1| b1| -|
+---+------+---+---+---+
scala> val resultDF=joinedDF.rdd.map{row=>
| val id=row.getAs[Int]("id")
| val values=row.getAs[String]("values")
| val feilds=row.getAs[String](values)
| (id,values,feilds)
| }.toDF("id","values","feilds")
resultDF: org.apache.spark.sql.DataFrame = [id: int, values: string ... 1 more field]
scala>
scala> resultDF.show
+---+------+------+
| id|values|feilds|
+---+------+------+
| 1| v3| b2|
| 1| v1| a2|
| 0| v1| a1|
| 0| v2| b1|
+---+------+------+