Apache spark 生成Spark数据帧的不相交集_Apache Spark_Apache Spark Sql

Apache spark 生成Spark数据帧的不相交集

apache-spark

Apache spark 生成Spark数据帧的不相交集,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,每当多个列中的一列具有相等的值时，我想将Spark数据帧分组。例如，对于以下df： val df = Seq( ("a1", "b1", "c1"), ("a1", "b2", "c2"), ("a3", "b2", "c3"), ("a4", "b4

每当多个列中的一列具有相等的值时，我想将Spark数据帧分组。例如，对于以下df：

  val df = Seq(
    ("a1", "b1", "c1"),
    ("a1", "b2", "c2"),
    ("a3", "b2", "c3"),
    ("a4", "b4", "c3"),
    ("a5", "b5", "c5")
  ).toDF("a", "b", "c")

只要

、

或

列的值匹配，我就希望进行分组。在示例数据帧字段

中，第一行与第二行匹配。第二行的字段

与第三行匹配，第三行的字段

与第四行匹配，因此它们都在同一组中（想想

union find

）。第五排是一套单人房

val grouped = Seq(
  ("a1", "b1", "c1", "1"),
  ("a1", "b2", "c2", "1"),
  ("a3", "b2", "c3", "1"),
  ("a4", "b4", "c3", "1"),
  ("a5", "b5", "c5", "2")
).toDF("a", "b", "c", "group")

我添加了

组

列作为可能不相交集结果的直觉。

试试这个，让我知道。基本上，我们将这些值替换为它们的发生计数，并过滤所有为1的值。警告：由于使用了collect（），因此计算量很大

结果:

+------------+---+---+---+---+---+
|          id|sum|  a|  b|  c|  d|
+------------+---+---+---+---+---+
|146028888064|4.0| a6| b6| c6|d27|
|171798691840|4.0| a9|b88|c54|d71|
+------------+---+---+---+---+---+

+---+---+---+---+----------------+-------------------+----+
|  a|  b|  c|  d|  elements_array|         main_array|flag|
+---+---+---+---+----------------+-------------------+----+
| a6| b6| c6|d27|[b2, c3, a1, d7]|  [a6, b6, c6, d27]|   0|
| a9|b88|c54|d71|[b2, c3, a1, d7]|[a9, b88, c54, d71]|   0|
+---+---+---+---+----------------+-------------------+----+

如果您需要一个组列，那么您可以将final join更改为right，并根据总和值将一个列作为组：

F.when（F.col（'sum'）==len（tst.columns），1）。否则（0）

请尝试此操作并告诉我。基本上，我们将这些值替换为它们的发生计数，并过滤所有为1的值。警告：由于使用了collect（），因此计算量很大

结果:

+------------+---+---+---+---+---+
|          id|sum|  a|  b|  c|  d|
+------------+---+---+---+---+---+
|146028888064|4.0| a6| b6| c6|d27|
|171798691840|4.0| a9|b88|c54|d71|
+------------+---+---+---+---+---+

+---+---+---+---+----------------+-------------------+----+
|  a|  b|  c|  d|  elements_array|         main_array|flag|
+---+---+---+---+----------------+-------------------+----+
| a6| b6| c6|d27|[b2, c3, a1, d7]|  [a6, b6, c6, d27]|   0|
| a9|b88|c54|d71|[b2, c3, a1, d7]|[a9, b88, c54, d71]|   0|
+---+---+---+---+----------------+-------------------+----+

如果您需要一个组列，那么您可以将最终联接更改为right，并根据和值将列作为组：

F.when（F.col（'sum'）==len（tst.columns），1）。否则（0）

我只想将其作为一个新答案添加，因为我不太确定cube over collect（）的性能。但我觉得这比我以前的答案要好。试试这个

import pyspark.sql.functions as F
from pyspark.sql.window import Window
#Test data
tst = sqlContext.createDataFrame([('a1','b1','c1','d1'),('a1','b2','c2','d2'),('a3','b2','c3','d6'),('a4','b4','c3','d7'),('a5','b5','c5','d7'),('a6','b6','c6','d27'),('a9','b88','c54','d71')],schema=['a','b','c','d'])
#%% aggregate and cube the columns and count

tst_res1 = tst.cube('a','b','c','d').count()
# We need count of individual values in columns. so we count how many nulls are there in column
tst_nc = tst_res1.withColumn("null_count",sum([F.when(F.col(x).isNull(),1).otherwise(0) for x in tst_res1.columns]))
# Filter only with 3 null values since we have 4 columns and select values that occur more than once
tst_flt = tst_nc.filter((F.col('null_count')==len(tst.columns)-1)& (F.col('count')>1))
# coalesce to get the elements that occur more than once
tst_coala= tst_flt.withColumn("elements",F.coalesce(*tst.columns))
# collect the elements that occur more than once in an element. 
tst_array = (tst_coala.groupby(F.lit(1)).agg(F.collect_list('elements').alias('elements'))).collect()
#%% convert elements to string, can be skipped for numericals
elements = map(str,tst_array[0]['elements'])
#%% introduce the values that occur more than once as an array in main df
tst_cmp= tst.withColumn("elements_array",F.array(map(F.lit,[x for x in elements])))
# convert the elements into an array
tst_cmp = tst_cmp.withColumn("main_array",F.array(*tst.columns))
#%% find if any of the elements in the row occur more than once in the entire data
tst_result = tst_cmp.withColumn("flag", F.size(F.array_intersect(F.col('main_array'),F.col('elements_array'))))
#%% select the disjoint values
tst_final = tst_result.where('flag=0')

结果:

+------------+---+---+---+---+---+
|          id|sum|  a|  b|  c|  d|
+------------+---+---+---+---+---+
|146028888064|4.0| a6| b6| c6|d27|
|171798691840|4.0| a9|b88|c54|d71|
+------------+---+---+---+---+---+

+---+---+---+---+----------------+-------------------+----+
|  a|  b|  c|  d|  elements_array|         main_array|flag|
+---+---+---+---+----------------+-------------------+----+
| a6| b6| c6|d27|[b2, c3, a1, d7]|  [a6, b6, c6, d27]|   0|
| a9|b88|c54|d71|[b2, c3, a1, d7]|[a9, b88, c54, d71]|   0|
+---+---+---+---+----------------+-------------------+----+

我只是想添加这个作为一个新的答案，因为我不太确定cube over collect（）的性能。但我觉得这比我以前的答案要好。试试这个

import pyspark.sql.functions as F
from pyspark.sql.window import Window
#Test data
tst = sqlContext.createDataFrame([('a1','b1','c1','d1'),('a1','b2','c2','d2'),('a3','b2','c3','d6'),('a4','b4','c3','d7'),('a5','b5','c5','d7'),('a6','b6','c6','d27'),('a9','b88','c54','d71')],schema=['a','b','c','d'])
#%% aggregate and cube the columns and count

tst_res1 = tst.cube('a','b','c','d').count()
# We need count of individual values in columns. so we count how many nulls are there in column
tst_nc = tst_res1.withColumn("null_count",sum([F.when(F.col(x).isNull(),1).otherwise(0) for x in tst_res1.columns]))
# Filter only with 3 null values since we have 4 columns and select values that occur more than once
tst_flt = tst_nc.filter((F.col('null_count')==len(tst.columns)-1)& (F.col('count')>1))
# coalesce to get the elements that occur more than once
tst_coala= tst_flt.withColumn("elements",F.coalesce(*tst.columns))
# collect the elements that occur more than once in an element. 
tst_array = (tst_coala.groupby(F.lit(1)).agg(F.collect_list('elements').alias('elements'))).collect()
#%% convert elements to string, can be skipped for numericals
elements = map(str,tst_array[0]['elements'])
#%% introduce the values that occur more than once as an array in main df
tst_cmp= tst.withColumn("elements_array",F.array(map(F.lit,[x for x in elements])))
# convert the elements into an array
tst_cmp = tst_cmp.withColumn("main_array",F.array(*tst.columns))
#%% find if any of the elements in the row occur more than once in the entire data
tst_result = tst_cmp.withColumn("flag", F.size(F.array_intersect(F.col('main_array'),F.col('elements_array'))))
#%% select the disjoint values
tst_final = tst_result.where('flag=0')

结果:

+------------+---+---+---+---+---+
|          id|sum|  a|  b|  c|  d|
+------------+---+---+---+---+---+
|146028888064|4.0| a6| b6| c6|d27|
|171798691840|4.0| a9|b88|c54|d71|
+------------+---+---+---+---+---+

+---+---+---+---+----------------+-------------------+----+
|  a|  b|  c|  d|  elements_array|         main_array|flag|
+---+---+---+---+----------------+-------------------+----+
| a6| b6| c6|d27|[b2, c3, a1, d7]|  [a6, b6, c6, d27]|   0|
| a9|b88|c54|d71|[b2, c3, a1, d7]|[a9, b88, c54, d71]|   0|
+---+---+---+---+----------------+-------------------+----+

对您的预期输出有点困惑。为什么将2指定给最后一行？你能解释一下这个逻辑吗？@Raghu确实如此。我更新了我的问题图，到时候会是……你有机会看看答案吗？很想知道你在想什么我还需要看看，谢谢！我已经使用Spark Graphx生成了联合查找数据结构。我可能会回到一个更特别的解决方案，但公平地说，结果看起来不错……对您的预期输出有点困惑。为什么将2指定给最后一行？你能解释一下这个逻辑吗？@Raghu确实如此。我更新了我的问题图，到时候会是……你有机会看看答案吗？很想知道你在想什么我还需要看看，谢谢！我已经使用Spark Graphx生成了联合查找数据结构。我可能会回到一个更特别的解决方案，但结果看起来不错，公平。。。