当:Spark/scala dataframe时,方法的参数不足
我在spark Df1和Df2中有2个数据帧我基于一个公共列(即Id)连接这2个数据帧,然后添加一个额外的列结果并检查多个列,如果有任何列数据匹配,那么我需要在新列中插入匹配,如果没有匹配的条件,那么需要在该列中作为“不匹配”传递。我正在写下面的代码当:Spark/scala dataframe时,方法的参数不足,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我在spark Df1和Df2中有2个数据帧我基于一个公共列(即Id)连接这2个数据帧,然后添加一个额外的列结果并检查多个列,如果有任何列数据匹配,那么我需要在新列中插入匹配,如果没有匹配的条件,那么需要在该列中作为“不匹配”传递。我正在写下面的代码 df1.join(df1,df2("id") === df2("id")) .withColumn("Result", when( df1("adhar_no") === df2("adhar_no")" || d
df1.join(df1,df2("id") === df2("id"))
.withColumn("Result",
when(
df1("adhar_no") === df2("adhar_no")" ||
df1("pan_no") === df2("pan_no") ||
df1("Voter_id") === df2("Voter_id") ||
df1("DL_no") === df2("DL_no"),"Matched"
).otherwise("Not Matched"))
But getting error
<console>:60: error: not enough arguments for method when: (condition: org.apache.spark.sql.Column, value: Any)org.apache.spark.sql.Column. Unspecified value parameter value.
I have also tried below code
df1.join(df2,df1("id") === df2("id"))
.withColumn("Result",when(df1("adhar_no") === df2("adhar_no") ||
when(df1("pan_no") === df2("pan_no") ||
when(df1("Voter_id") === df2("Voter_id") ||
when(df1("DL_no") === df2("DL_no"),"Matched"))))
.otherwise("Not Matched"))
df1.join(df1,df2(“id”)==df2(“id”))
.withColumn(“结果”,
什么时候(
df1(“adhar_no”)==df2(“adhar_no”)”|
df1(“盘号”)==df2(“盘号”)|
df1(“投票者id”)==df2(“投票者id”)||
df1(“DL_编号”)==df2(“DL_编号”),“匹配”
)。否则(“不匹配”))
但是得到了错误
:60:错误:当:(条件:org.apache.spark.sql.Column,值:Any)org.apache.spark.sql.Column.未指定的值参数值时,方法的参数不足。
我也试过下面的代码
df1.join(df2,df1(“id”)==df2(“id”))
当(df1(“adhar_no”)==df2(“adhar_no”)|时,使用列(“结果”)
当(df1(“pan_no”)==df2(“pan_no”)|
当(df1(“投票者id”)==df2(“投票者id”)||
当(df1(“DL_编号”)==df2(“DL_编号”),“匹配”))
。否则(“不匹配”))
在这两种情况下,我都遇到了错误。有人能帮我怎么做吗。第一种情况是因为第4行有一个额外的
“
(第一种情况下)
这将很好地工作:
df1.join(df2,df2("id") === df2("id"))
.withColumn("Result",
when(
df1("adhar_no") === df2("adhar_no") ||
df1("pan_no") === df2("pan_no") ||
df1("Voter_id") === df2("Voter_id") ||
df1("DL_no") === df2("DL_no"),"Matched"
).otherwise("Not Matched"))
第二个是因为每个时候都必须有一个输出值:这个例子对我来说毫无意义。第一个很好,但您需要删除yout extra”(我假设是一种类型)
此外,作为个人偏好或建议,我更愿意使用美元语法引用该专栏。这对我来说更清晰,并帮助我避免此类拼写错误
用示例编辑
一些糟糕的测试数据帧
val df1 = List((1, 10, 100, 1000, 10000), (2, 20, 200, 2000, 20000), (3, 30, 300, 3000, 30000)).toDF("id","adhar_no", "pan_no", "Voter_id", "DL_no")
val df2 = List((1, 10, 100, 1000, 10000), (2, 20, 200, 2000, 20000), (4, 40, 400, 4000, 40000)).toDF("id","adhar_no", "pan_no", "Voter_id", "DL_no")
然后,修复代码中的歧义:
df1.as("df1").join(df2.as("df2"), df1("id") === df2("id"))
.withColumn("Result", when(
$"df1.adhar_no" === $"df2.adhar_no" ||
$"df1.pan_no" === $"df2.pan_no" ||
$"df1.Voter_id" === $"df2.Voter_id" ||
$"df1.DL_no" === $"df2.DL_no"
, "Matched"
).otherwise("Not Matched")
)
+---+--------+------+--------+-----+---+--------+------+--------+-----+-------+
| id|adhar_no|pan_no|Voter_id|DL_no| id|adhar_no|pan_no|Voter_id|DL_no| Result|
+---+--------+------+--------+-----+---+--------+------+--------+-----+-------+
| 1| 10| 100| 1000|10000| 1| 10| 100| 1000|10000|Matched|
| 2| 20| 200| 2000|20000| 2| 20| 200| 2000|20000|Matched|
+---+--------+------+--------+-----+---+--------+------+--------+-----+-------+
第一种情况是因为第4行有一个额外的
“
(第一种情况下)
这将很好地工作:
df1.join(df2,df2("id") === df2("id"))
.withColumn("Result",
when(
df1("adhar_no") === df2("adhar_no") ||
df1("pan_no") === df2("pan_no") ||
df1("Voter_id") === df2("Voter_id") ||
df1("DL_no") === df2("DL_no"),"Matched"
).otherwise("Not Matched"))
第二个是因为每个时候都必须有一个输出值:这个例子对我来说毫无意义。第一个很好,但您需要删除yout extra”(我假设是一种类型)
此外,作为个人偏好或建议,我更愿意使用美元语法引用该专栏。这对我来说更清晰,并帮助我避免此类拼写错误
用示例编辑
一些糟糕的测试数据帧
val df1 = List((1, 10, 100, 1000, 10000), (2, 20, 200, 2000, 20000), (3, 30, 300, 3000, 30000)).toDF("id","adhar_no", "pan_no", "Voter_id", "DL_no")
val df2 = List((1, 10, 100, 1000, 10000), (2, 20, 200, 2000, 20000), (4, 40, 400, 4000, 40000)).toDF("id","adhar_no", "pan_no", "Voter_id", "DL_no")
然后,修复代码中的歧义:
df1.as("df1").join(df2.as("df2"), df1("id") === df2("id"))
.withColumn("Result", when(
$"df1.adhar_no" === $"df2.adhar_no" ||
$"df1.pan_no" === $"df2.pan_no" ||
$"df1.Voter_id" === $"df2.Voter_id" ||
$"df1.DL_no" === $"df2.DL_no"
, "Matched"
).otherwise("Not Matched")
)
+---+--------+------+--------+-----+---+--------+------+--------+-----+-------+
| id|adhar_no|pan_no|Voter_id|DL_no| id|adhar_no|pan_no|Voter_id|DL_no| Result|
+---+--------+------+--------+-----+---+--------+------+--------+-----+-------+
| 1| 10| 100| 1000|10000| 1| 10| 100| 1000|10000|Matched|
| 2| 20| 200| 2000|20000| 2| 20| 200| 2000|20000|Matched|
+---+--------+------+--------+-----+---+--------+------+--------+-----+-------+
您正在加入相同的df1.join(df1)如何访问df2?您正在加入相同的df1.join(df1)你是如何访问df2的?是的,这是输入错误我已经更新了代码,但我得到了相同的错误。在修复了一些其他错误后,它对我来说运行良好,我将用一个例子@SCouto更新我的答案是有效的。。非常感谢SCouto。我还有一个疑问,我正在尝试实现更多的情况,比如当两个列名中的列名都为null时恩,它应该给出不匹配的,我有这样的书写..withColumn(“result”,当(df1(“name”)==df2(“id”)| |(df1(“name”)为null,df2(“name”)为null,“notmatched”)。否则(“matched”))。但它显示未找到&。是的,它是打字错误。我已更新了代码,但我得到了相同的错误。在修复了其他一些错误后,它对我来说运行良好,我将使用一个示例@SCouto更新我的答案是有效的。。非常感谢SCouto。我还有一个疑问,我正在尝试实现更多的情况,例如,当两个列名中的列名都为null时然后它应该给出不匹配的,我写的是这样的..withColumn(“result”),当(df1(“name”)==df2(“id”)| |(df1(“name”)为null,df2(“name”)为null,“notmatched”)。否则(“matched”)。但它的显示未找到&。