Python 比较两个数据帧

Python 比较两个数据帧,python,dataframe,apache-spark,pyspark,apache-spark-sql,Python,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我试图比较两个列数相同的数据帧,即两个数据帧中id为键列的4列 df1 = spark.read.csv("/path/to/data1.csv") df2 = spark.read.csv("/path/to/data2.csv") 现在我想将新列附加到DF2,即column_names,它是具有不同于df1的值的列的列表 df2.withColumn("column_names",udf()) DF1 DF2: 现在我想要DF3 DF3: 我看到这样的问题。尝试过,但是结果不同 我正在考

我试图比较两个列数相同的数据帧,即两个数据帧中id为键列的4列

df1 = spark.read.csv("/path/to/data1.csv")
df2 = spark.read.csv("/path/to/data2.csv")
现在我想将新列附加到DF2,即column_names,它是具有不同于df1的值的列的列表

df2.withColumn("column_names",udf())
DF1

DF2:

现在我想要DF3

DF3:

我看到这样的问题。尝试过,但是结果不同

我正在考虑使用一个UDF函数,将每个数据帧中的行传递给UDF,然后逐列比较并返回列列表。但是,对于这两个数据帧,都应该按顺序排序,以便将相同的id行发送到udf。这里的分拣操作成本很高。有什么解决方案吗?

Python:PySpark版本的我以前的scala代码

import pyspark.sql.functions as f

df1 = spark.read.option("header", "true").csv("test1.csv")
df2 = spark.read.option("header", "true").csv("test2.csv")

columns = df1.columns
df3 = df1.alias("d1").join(df2.alias("d2"), f.col("d1.id") == f.col("d2.id"), "left")

for name in columns:
    df3 = df3.withColumn(name + "_temp", f.when(f.col("d1." + name) != f.col("d2." + name), f.lit(name)))


df3.withColumn("column_names", f.concat_ws(",", *map(lambda name: f.col(name + "_temp"), columns))).select("d1.*", "column_names").show()
Scala:这是我解决你问题的最佳方法

val df1 = spark.read.option("header", "true").csv("test1.csv")
val df2 = spark.read.option("header", "true").csv("test2.csv")

val columns = df1.columns
val df3 = df1.alias("d1").join(df2.alias("d2"), col("d1.id") === col("d2.id"), "left")

columns.foldLeft(df3) {(df, name) => df.withColumn(name + "_temp", when(col("d1." + name) =!= col("d2." + name), lit(name)))}
  .withColumn("column_names", concat_ws(",", columns.map(name => col(name + "_temp")): _*))
  .show(false)
首先,我将两个dataframe连接到df3中,并使用df1中的列。当df1和df2具有相同的id和其他列值时,使用具有列名称值的临时列向左折叠到df3

之后,这些列名的concat_ws和null都消失了,只剩下列名

+---+----+----+-------+------------+
|id |name|sal |Address|column_names|
+---+----+----+-------+------------+
|1  |ABC |5000|US     |            |
|2  |DEF |4000|UK     |Address     |
|3  |GHI |3000|JPN    |sal         |
|4  |JKL |4500|CHN    |name,sal    |
+---+----+----+-------+------------+
与预期结果不同的是,输出不是列表,而是字符串


p、 我忘了使用PySpark,但这是正常的spark,抱歉。

这是您使用UDF的解决方案,我已动态更改了第一个数据帧名称,以便在检查过程中不会出现歧义。请仔细阅读下面的代码,如果有任何问题,请告诉我

>>> from pyspark.sql.functions import *
>>> df.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
|  1| ABC|5000|     US|
|  2| DEF|4000|     UK|
|  3| GHI|3000|    JPN|
|  4| JKL|4500|    CHN|
+---+----+----+-------+

>>> df1.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
|  1| ABC|5000|     US|
|  2| DEF|4000|    CAN|
|  3| GHI|3500|    JPN|
|  4|JKLM|4800|    CHN|
+---+----+----+-------+

>>> df2 = df.select([col(c).alias("x_"+c) for c in df.columns])
>>> df3 = df1.join(df2, col("id") == col("x_id"), "left")

 //udf declaration 

>>> def CheckMatch(Column,r):
...     check=''
...     ColList=Column.split(",")
...     for cc in ColList:
...             if(r[cc] != r["x_" + cc]):
...                     check=check + "," + cc
...     return check.replace(',','',1).split(",")

>>> CheckMatchUDF = udf(CheckMatch)

//final column that required to select
>>> finalCol = df1.columns
>>> finalCol.insert(len(finalCol), "column_names")

>>> df3.withColumn("column_names", CheckMatchUDF(lit(','.join(df1.columns)),struct([df3[x] for x in df3.columns])))
       .select(finalCol)
       .show()
+---+----+----+-------+------------+
| id|name| sal|Address|column_names|
+---+----+----+-------+------------+
|  1| ABC|5000|     US|          []|
|  2| DEF|4000|    CAN|   [Address]|
|  3| GHI|3500|    JPN|       [sal]|
|  4|JKLM|4800|    CHN| [name, sal]|
+---+----+----+-------+------------+

假设我们可以使用id连接这两个数据集,我认为不需要UDF。这可以通过使用内部连接和函数等来解决

首先,让我们创建两个数据集:

df1=spark.createDataFrame[ [1,ABC,5000,美国], [2,DEF,4000,英国], [3,GHI,3000,JPN], [4,JKL,4500,CHN] ],[id、姓名、sal、地址] df2=spark.createDataFrame[ [1,ABC,5000,美国], [2,DEF,4000,CAN], [3,GHI,3500,JPN], [4,JKL_M,4800,CHN] ],[id、姓名、sal、地址] 首先,我们在两个数据集之间进行内部联接,然后生成条件df1[col]!=df2[col]用于除id之外的每列。当列不相等时,我们返回列名,否则返回空字符串。条件列表将包含一个数组的项,最后我们从中删除空项:

从pyspark.sql.functions导入col、array、when、array\u remove 获取除id之外的所有列的条件 条件\=[whendf1[c]!=df2[c],litc.否则为df1.columns中的c,如果c!='id'] 选择_expr=[ 科利德, *[df2[c]用于df2.columns中的c,如果c!=“id”], 数组\移除数组*条件\列\名称 ] df1.joindf2,id.select*select\u expr.show +--+---+--+----+------+ |id |姓名| sal |地址|列名|| +--+---+--+----+------+ |1 | ABC | 5000 |美国|【】| |3 | GHI | 3500 | JPN |[sal]| |2 | DEF | 4000 | CAN |[地址]| |4 | JKL|u M | 4800 | CHN |[姓名,sal]| +--+---+--+----+------+
您可以通过该包在PySpark和Scala中获得该查询构建。 它提供了diff转换,而diff转换正是这样做的

从gresearch.spark.diff导入* 选项=DiffOptions。带有“更改”列 df1.diff_与选项df2,选项“id”。显示 +--+------+--+-----+-----+----+-----+------+-------+ |差异|更改| id |左|名|右|名|左|萨尔|右|萨尔|左|地址|右|地址| +--+------+--+-----+-----+----+-----+------+-------+ |N |[]1 | ABC | ABC | 5000 | 5000 | US | US| |C |[地址]| 2 | DEF | DEF | 4000 | 4000 | UK | CAN| |C |[sal]| 3 | GHI | GHI | 3000 | 3500 | JPN | JPN| |C |[姓名,萨尔]| 4 | JKL | JKL | M | 4500 | 4800 | CHN | CHN| +--+------+--+-----+-----+----+-----+------+-------+
虽然这是一个简单的示例,但当涉及广泛的模式、插入、删除和空值时,差异数据帧可能会变得复杂。该软件包经过了很好的测试,所以您不必担心自己的查询是否正确。

您想要pyspark还是spark中的解决方案?是指scala还是python?我在pythonSorting中寻找解决方案是一项代价高昂的操作。-在这种情况下,我认为没有比分类更好的方法了。这里不需要自定义项。在id上使用left join,然后比较列值并创建新的列名称。您不需要在此处执行其他操作。将它们保留为null,然后用户筛选器或array_except将删除null值。array_except仅适用于array_exceptarray*条件,ArrayItnone将为创建新数组带来额外开销,而不需要它。至于过滤器,我认为pyspark只能通过expr或selectExpr提供,或者至少databricks拒绝将其包含在从pyspark.sql.functions导入过滤器中,而且实际上,我有一个sim卡中似乎没有 ilar场景继续,删除链接,提前感谢@Nikk
val df1 = spark.read.option("header", "true").csv("test1.csv")
val df2 = spark.read.option("header", "true").csv("test2.csv")

val columns = df1.columns
val df3 = df1.alias("d1").join(df2.alias("d2"), col("d1.id") === col("d2.id"), "left")

columns.foldLeft(df3) {(df, name) => df.withColumn(name + "_temp", when(col("d1." + name) =!= col("d2." + name), lit(name)))}
  .withColumn("column_names", concat_ws(",", columns.map(name => col(name + "_temp")): _*))
  .show(false)
+---+----+----+-------+------------+
|id |name|sal |Address|column_names|
+---+----+----+-------+------------+
|1  |ABC |5000|US     |            |
|2  |DEF |4000|UK     |Address     |
|3  |GHI |3000|JPN    |sal         |
|4  |JKL |4500|CHN    |name,sal    |
+---+----+----+-------+------------+
>>> from pyspark.sql.functions import *
>>> df.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
|  1| ABC|5000|     US|
|  2| DEF|4000|     UK|
|  3| GHI|3000|    JPN|
|  4| JKL|4500|    CHN|
+---+----+----+-------+

>>> df1.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
|  1| ABC|5000|     US|
|  2| DEF|4000|    CAN|
|  3| GHI|3500|    JPN|
|  4|JKLM|4800|    CHN|
+---+----+----+-------+

>>> df2 = df.select([col(c).alias("x_"+c) for c in df.columns])
>>> df3 = df1.join(df2, col("id") == col("x_id"), "left")

 //udf declaration 

>>> def CheckMatch(Column,r):
...     check=''
...     ColList=Column.split(",")
...     for cc in ColList:
...             if(r[cc] != r["x_" + cc]):
...                     check=check + "," + cc
...     return check.replace(',','',1).split(",")

>>> CheckMatchUDF = udf(CheckMatch)

//final column that required to select
>>> finalCol = df1.columns
>>> finalCol.insert(len(finalCol), "column_names")

>>> df3.withColumn("column_names", CheckMatchUDF(lit(','.join(df1.columns)),struct([df3[x] for x in df3.columns])))
       .select(finalCol)
       .show()
+---+----+----+-------+------------+
| id|name| sal|Address|column_names|
+---+----+----+-------+------------+
|  1| ABC|5000|     US|          []|
|  2| DEF|4000|    CAN|   [Address]|
|  3| GHI|3500|    JPN|       [sal]|
|  4|JKLM|4800|    CHN| [name, sal]|
+---+----+----+-------+------------+