Scala 如何用另一个dataframe列替换dataframe列
我有两个数据帧: dataframe1Scala 如何用另一个dataframe列替换dataframe列,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有两个数据帧: dataframe1 DATE1| +----------+ |2017-01-08| |2017-10-10| |2017-05-01| dataframe2 | NAME | SID | DATE1 | DATE2 | ROLL | SCHOOL| +------+----+----------+----------+----+--------+ |萨亚姆| 22.0 | 8/1/2017 | 7 1 2017 | 3223 |巴巴| |阿达什| 2.0 | 10-10-
DATE1|
+----------+
|2017-01-08|
|2017-10-10|
|2017-05-01|
dataframe2
| NAME | SID | DATE1 | DATE2 | ROLL | SCHOOL|
+------+----+----------+----------+----+--------+
|萨亚姆| 22.0 | 8/1/2017 | 7 1 2017 | 3223 |巴巴|
|阿达什| 2.0 | 10-10-2017 | 10.03.2017 | 222 |阳光|
|萨迪姆| 1.0 | 1.5.2017 | 1/2/2017 | 111 | DAV|
预期产量
| NAME | SID | DATE1 | DATE2 | ROLL | SCHOOL|
+------+----+----------+----------+----+--------+
|萨亚姆| 22.0 | 2017-01-08 | 7 1 2017 | 3223 |巴巴|
|阿达什| 2.0 | 2017-10-10 | 10.03.2017 | 222 |阳光|
|SADIM | 1.0 | 2017-05-01 | 1/2/2017 | 111 | DAV|
我想用dataframe1的DATE1
列替换dataframe2中的DATE1
列。我需要一个通用的解决方案
任何帮助都将不胜感激
我尝试了以下列方法
dataframe2.withColumn(newColumnTransformInfo._1, dataframe1.col("DATE1").cast(DateType))
但是,我得到了一个错误:
org.apache.spark.sql.AnalysisException:已解析属性
无法从其他数据帧添加列
您可以做的是连接两个数据帧并保留所需的列,这两个数据帧必须有一个公共连接列。如果您没有公共列且数据符合顺序,则可以为两个dataframe分配一个递增的id,然后加入
这是你的案例的简单例子
//Dummy data
val df1 = Seq(
("2017-01-08"),
("2017-10-10"),
("2017-05-01")
).toDF("DATE1")
val df2 = Seq(
("Sayam", 22.0, "2017-01-08", "7 1 2017", 3223, "BHABHA"),
("ADARSH", 2.0, "2017-10-10", "10.03.2017", 222, "SUNSHINE"),
("SADIM", 1.0, "2017-05-01", "1/2/2017", 111, "DAV")
).toDF("NAME", "SID", "DATE1", "DATE2", "ROLL", "SCHOOL")
//create new Dataframe1 with new column id
val rows1 = df1.rdd.zipWithIndex().map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
val dataframe1 = spark.createDataFrame(rows1, StructType(StructField("id", LongType, false) +: df1.schema.fields))
//create new Dataframe2 with new column id
val rows2= df2.rdd.zipWithIndex().map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
val dataframe2 = spark.createDataFrame(rows2, StructType(StructField("id", LongType, false) +: df2.schema.fields))
dataframe2.drop("DATE1")
.join(dataframe1, "id")
.drop("id").show()
输出:
+------+----+----------+----+--------+----------+
| NAME| SID| DATE2|ROLL| SCHOOL| DATE1|
+------+----+----------+----+--------+----------+
| Sayam|22.0| 7 1 2017|3223| BHABHA|2017-01-08|
|ADARSH| 2.0|10.03.2017| 222|SUNSHINE|2017-10-10|
| SADIM| 1.0| 1/2/2017| 111| DAV|2017-05-01|
+------+----+----------+----+--------+----------+
希望这有帮助 无法从其他数据帧添加列
您可以做的是连接两个数据帧并保留所需的列,这两个数据帧必须有一个公共连接列。如果您没有公共列且数据符合顺序,则可以为两个dataframe分配一个递增的id,然后加入
这是你的案例的简单例子
//Dummy data
val df1 = Seq(
("2017-01-08"),
("2017-10-10"),
("2017-05-01")
).toDF("DATE1")
val df2 = Seq(
("Sayam", 22.0, "2017-01-08", "7 1 2017", 3223, "BHABHA"),
("ADARSH", 2.0, "2017-10-10", "10.03.2017", 222, "SUNSHINE"),
("SADIM", 1.0, "2017-05-01", "1/2/2017", 111, "DAV")
).toDF("NAME", "SID", "DATE1", "DATE2", "ROLL", "SCHOOL")
//create new Dataframe1 with new column id
val rows1 = df1.rdd.zipWithIndex().map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
val dataframe1 = spark.createDataFrame(rows1, StructType(StructField("id", LongType, false) +: df1.schema.fields))
//create new Dataframe2 with new column id
val rows2= df2.rdd.zipWithIndex().map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
val dataframe2 = spark.createDataFrame(rows2, StructType(StructField("id", LongType, false) +: df2.schema.fields))
dataframe2.drop("DATE1")
.join(dataframe1, "id")
.drop("id").show()
输出:
+------+----+----------+----+--------+----------+
| NAME| SID| DATE2|ROLL| SCHOOL| DATE1|
+------+----+----------+----+--------+----------+
| Sayam|22.0| 7 1 2017|3223| BHABHA|2017-01-08|
|ADARSH| 2.0|10.03.2017| 222|SUNSHINE|2017-10-10|
| SADIM| 1.0| 1/2/2017| 111| DAV|2017-05-01|
+------+----+----------+----+--------+----------+
希望这有帮助 现在它正在工作。如果其他测试用例有任何问题,我会告诉您。现在它正在工作。如果其他测试用例有任何问题,我会告诉您。