Scala 通过id连接两个数据帧
这一问题与国际法有关。我在Scala中有两个数据帧:Scala 通过id连接两个数据帧,scala,apache-spark,Scala,Apache Spark,这一问题与国际法有关。我在Scala中有两个数据帧: df1 = ID start_date_time field1 field2 1 2016-10-12 11:55:23 AAA xxx1 2 2016-10-12 12:25:00 BBB xxx2 3 2016-10-12 16:20:00 CCC xxx3 及 我需要向df1添加一个新列,如果以下条件失败,该列的值将为0,否则->1: If ID == PK a
df1 =
ID start_date_time field1 field2
1 2016-10-12 11:55:23 AAA xxx1
2 2016-10-12 12:25:00 BBB xxx2
3 2016-10-12 16:20:00 CCC xxx3
及
我需要向df1添加一个新列,如果以下条件失败,该列的值将为0,否则->1:
If ID == PK and start_date_time refers to the same year, month and day as start_date.
结果应该是这样的:
df1 =
ID start_date_time check field1 field2
1 2016-10-12-11-55-23 1 AAA xxx1
2 2016-10-12-12-25-00 0 BBB xxx2
3 2016-10-12-16-20-00 0 CCC xxx3
我使用这个解决方案:
import org.apache.spark.sql.functions.lit
val df1_date = df1.withColumn("date", to_date(df1("start_date_time")))
val df2_date = (df2.withColumn("date", to_date(df2("start_date"))).
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check", $"field1", $"field2"))
df1_date.join(df2_date, Seq("ID", "date"), "left").drop($"date").na.fill(0).show
但是,在select($“PK”.as(“ID”),$“date”,$“check”,$“field1”,$“field2”))中,是否可以不明确提及df1
中的所有列名?
可以这样做吗?:select($“PK”.as(“ID”),$“date”,$“check”,*)
如果不需要删除任何额外的列,您可以重命名PK
列val df2_date=(df2.withColumn(“date”),to_date(df2(“start_date”))。withColumn(“check”,lit(1))。withColumn重命名(“PK”,“ID”)
如果不需要删除任何额外的列,您只需重命名PK
列即可<代码>val df2_date=(df2.带列(“日期”),到日期(df2(“开始日期”))。带列(“检查”,点亮(1))。带列重命名(“主键”,“ID”)
import org.apache.spark.sql.functions.lit
val df1_date = df1.withColumn("date", to_date(df1("start_date_time")))
val df2_date = (df2.withColumn("date", to_date(df2("start_date"))).
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check", $"field1", $"field2"))
df1_date.join(df2_date, Seq("ID", "date"), "left").drop($"date").na.fill(0).show