Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 如何将Spark dataframe列与其他dataframe列值进行比较_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

Scala 如何将Spark dataframe列与其他dataframe列值进行比较

Scala 如何将Spark dataframe列与其他dataframe列值进行比较,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有两个数据帧,如下所示: df1: Which will have few values as below. This is dynamic. +--------------+ |tags | +--------------+ |first_name | |last_name | |primary_email | |other_email | +--------------+ +--------------------------------------

我有两个数据帧,如下所示:

df1: Which will have few values as below. This is dynamic.

+--------------+
|tags          |
+--------------+
|first_name    |
|last_name     |
|primary_email |
|other_email   |
+--------------+
+---------------------------------------------------------------------------------------------+
|combinations                                                                                 |
+---------------------------------------------------------------------------------------------+
|last_name, first_name, primary_email                                                         |
|last_name, first_name, other_email                                                           |
|last_name, primary_email, primary_phone                                                      |
|last_name, primary_email, secondary_phone                                                    |
|last_name, address_line1, address_line2,city_name, state_name,postal_code, country_code, guid|
+---------------------------------------------------------------------------------------------+
df2:第二个数据帧有几个预定义的组合,如下所示:

df1: Which will have few values as below. This is dynamic.

+--------------+
|tags          |
+--------------+
|first_name    |
|last_name     |
|primary_email |
|other_email   |
+--------------+
+---------------------------------------------------------------------------------------------+
|combinations                                                                                 |
+---------------------------------------------------------------------------------------------+
|last_name, first_name, primary_email                                                         |
|last_name, first_name, other_email                                                           |
|last_name, primary_email, primary_phone                                                      |
|last_name, primary_email, secondary_phone                                                    |
|last_name, address_line1, address_line2,city_name, state_name,postal_code, country_code, guid|
+---------------------------------------------------------------------------------------------+
预期结果: 现在,我想从我的数据帧中找出,我能做的任何有效的组合。如果结果与
组合
数据帧中的任何组合匹配,则结果应包含所有有效组合

resultDF:

+---------------------------------------------------------------------------------------------+
|combinations                                                                                 |
+---------------------------------------------------------------------------------------------+
|last_name, first_name, primary_email                                                         |
|last_name, first_name, other_email                                                           |
+---------------------------------------------------------------------------------------------+
我尝试了一种将两个数据帧转换为列表并进行比较的方法,但我总是得到0个组合

我试过的scala代码

val combinationList = combinations.map(r => r.getString(0)).collect.toList

var combList: Seq[Seq[String]]  = Seq.empty

    for (comb <- combinationList) {
      var tmp: Seq[String] = Seq.empty
      tmp = tmp :+ comb
      combList = combList :+ tmp
    }

val result = combList.filter(
      list => df1.filter(df1.col("tags").isin(list: _*)).count == list.size
    )

println(result.size)
val combinationList=combinations.map(r=>r.getString(0)).collect.toList
变量组合列表:Seq[Seq[String]]=Seq.empty
for(comb df1.filter(df1.col(“tags”).isin(list:*).count==list.size
)
println(结果大小)
这总是返回0。答案应该是2

有人能告诉我什么是最好的方法吗?

试试这个。 收集df1,用df1的值在df2中创建一个新的数组列。 使用
array\u比较两个数组,但如果使用Spark 2.4,则使用
,后者返回两个数组的差异。然后,如果其大小==0,则进行筛选

scala> val df1 = Seq(
     |   "first_name",
     |   "last_name",
     |   "primary_email",
     |   "other_email" 
     | ).toDF("tags")
df1: org.apache.spark.sql.DataFrame = [tags: string]

scala> 

scala> val df2 = Seq(
     | Seq("last_name", "first_name", "primary_email"),                                                         
     | Seq("last_name", "first_name", "other_email"),
     | Seq("last_name", "primary_email", "primary_phone"),                                                      
     | Seq("last_name", "primary_email", "secondary_phone"),
     | Seq("last_name", "address_line1", "address_line2", "city_name", "state_name", "postal_code", "country_code", "guid")
     | ).toDF("combinations")
df2: org.apache.spark.sql.DataFrame = [combinations: array<string>]

scala> 

scala> df1.show(false)
+-------------+
|tags         |
+-------------+
|first_name   |
|last_name    |
|primary_email|
|other_email  |
+-------------+


scala> 

scala> df2.show(false)
+-------------------------------------------------------------------------------------------------+
|combinations                                                                                     |
+-------------------------------------------------------------------------------------------------+
|[last_name, first_name, primary_email]                                                           |
|[last_name, first_name, other_email]                                                             |
|[last_name, primary_email, primary_phone]                                                        |
|[last_name, primary_email, secondary_phone]                                                      |
|[last_name, address_line1, address_line2, city_name, state_name, postal_code, country_code, guid]|
+-------------------------------------------------------------------------------------------------+


scala> 

scala> val df1tags = df1.collect.map(r => r.getString(0))
df1tags: Array[String] = Array(first_name, last_name, primary_email, other_email)

scala> 

scala> val df3 = df2.withColumn("tags", lit(df1tags))
df3: org.apache.spark.sql.DataFrame = [combinations: array<string>, tags: array<string>]

scala> df3.show(false)
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+
|combinations                                                                                     |tags                                               |
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+
|[last_name, first_name, primary_email]                                                           |[first_name, last_name, primary_email, other_email]|
|[last_name, first_name, other_email]                                                             |[first_name, last_name, primary_email, other_email]|
|[last_name, primary_email, primary_phone]                                                        |[first_name, last_name, primary_email, other_email]|
|[last_name, primary_email, secondary_phone]                                                      |[first_name, last_name, primary_email, other_email]|
|[last_name, address_line1, address_line2, city_name, state_name, postal_code, country_code, guid]|[first_name, last_name, primary_email, other_email]|
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+


scala> 

scala> val df4 = df3.withColumn("combMinusTags", array_except($"combinations", $"tags"))
df4: org.apache.spark.sql.DataFrame = [combinations: array<string>, tags: array<string> ... 1 more field]

scala> df4.show(false)
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+--------------------------------------------------------------------------------------+
|combinations                                                                                     |tags                                               |combMinusTags                                                                         |
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+--------------------------------------------------------------------------------------+
|[last_name, first_name, primary_email]                                                           |[first_name, last_name, primary_email, other_email]|[]                                                                                    |
|[last_name, first_name, other_email]                                                             |[first_name, last_name, primary_email, other_email]|[]                                                                                    |
|[last_name, primary_email, primary_phone]                                                        |[first_name, last_name, primary_email, other_email]|[primary_phone]                                                                       |
|[last_name, primary_email, secondary_phone]                                                      |[first_name, last_name, primary_email, other_email]|[secondary_phone]                                                                     |
|[last_name, address_line1, address_line2, city_name, state_name, postal_code, country_code, guid]|[first_name, last_name, primary_email, other_email]|[address_line1, address_line2, city_name, state_name, postal_code, country_code, guid]|
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+--------------------------------------------------------------------------------------+


scala> 

scala> 

scala> df4.filter(size($"combMinusTags") === 0).show(false)
+--------------------------------------+---------------------------------------------------+-------------+
|combinations                          |tags                                               |combMinusTags|
+--------------------------------------+---------------------------------------------------+-------------+
|[last_name, first_name, primary_email]|[first_name, last_name, primary_email, other_email]|[]           |
|[last_name, first_name, other_email]  |[first_name, last_name, primary_email, other_email]|[]           |
+--------------------------------------+---------------------------------------------------+-------------+


scala>val df1=Seq(
|“名字”,
|“姓”,
|“主要电子邮件”,
|“其他电子邮件”
|).toDF(“标签”)
df1:org.apache.spark.sql.DataFrame=[tags:string]
斯卡拉>
scala>val df2=序列(
|序号(“姓氏”、“名”、“主要电子邮件”),
|Seq(“姓”、“名”、“其他电子邮件”),
|序号(“姓氏”、“主要电子邮件”、“主要电话”),
|序号(“姓氏”、“主要电子邮件”、“次要电话”),
|序号(“姓氏”、“地址行1”、“地址行2”、“城市名”、“州名”、“邮政编码”、“国家代码”、“guid”)
|).toDF(“组合”)
df2:org.apache.spark.sql.DataFrame=[组合:数组]
斯卡拉>
scala>df1.show(false)
+-------------+
|标签|
+-------------+
|名字|
|姓|
|主要电子邮件|
|其他电子邮件|
+-------------+
斯卡拉>
scala>df2.show(false)
+-------------------------------------------------------------------------------------------------+
|组合|
+-------------------------------------------------------------------------------------------------+
|[姓、名、主要电子邮件]|
|[姓、名、其他电子邮件]|
|[姓氏、主要电子邮件、主要电话]|
|[姓氏、主要电子邮件、次要电话]|
|[姓氏、地址第1行、地址第2行、城市名称、州名称、邮政编码、国家代码、guid]|
+-------------------------------------------------------------------------------------------------+
斯卡拉>
scala>val-df1tags=df1.collect.map(r=>r.getString(0))
df1tags:Array[String]=数组(名字、姓氏、主要电子邮件、其他电子邮件)
斯卡拉>
scala>val df3=df2。带列(“标记”,亮起(df1tags))
df3:org.apache.spark.sql.DataFrame=[组合:数组,标记:数组]
scala>df3.show(false)
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+
|组合|标记|
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+
|[姓、名、主要电子邮件]|[姓、名、主要电子邮件、其他电子邮件]|
|[姓氏、姓氏、其他电子邮件]|[姓氏、姓氏、主要电子邮件、其他电子邮件]|
|[姓氏,主要电子邮件,主要电话]|[名字,姓氏,主要电子邮件,其他电子邮件]|
|[姓氏,主要电子邮件,次要电话]|[名字,姓氏,主要电子邮件,其他电子邮件]|
|[姓氏、地址第1行、地址第2行、城市名称、州名称、邮政编码、国家代码、guid]|[名字、姓氏、主要电子邮件、其他电子邮件]|
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+
斯卡拉>
scala>val df4=df3.withColumn(“combMinusTags”,array_,除了($“combines”,$“tags”))
df4:org.apache.spark.sql.DataFrame=[组合:数组,标记:数组…还有一个字段]
scala>df4.show(false)
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+--------------------------------------------------------------------------------------+
|组合|标签|组合标签|
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+--------------------------------------------------------------------------------------+
|[姓氏、姓氏、主要电子邮件]|[姓氏、姓氏、主要电子邮件、其他电子邮件]|[]|
|[姓,名,其他电子邮件]|[姓,l