Python 比较两个数据帧
我试图比较两个列数相同的数据帧,即两个数据帧中id为键列的4列Python 比较两个数据帧,python,dataframe,apache-spark,pyspark,apache-spark-sql,Python,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我试图比较两个列数相同的数据帧,即两个数据帧中id为键列的4列 df1 = spark.read.csv("/path/to/data1.csv") df2 = spark.read.csv("/path/to/data2.csv") 现在我想将新列附加到DF2,即column_names,它是具有不同于df1的值的列的列表 df2.withColumn("column_names",udf()) DF1 DF2: 现在我想要DF3 DF3: 我看到这样的问题。尝试过,但是结果不同 我正在考
df1 = spark.read.csv("/path/to/data1.csv")
df2 = spark.read.csv("/path/to/data2.csv")
现在我想将新列附加到DF2,即column_names,它是具有不同于df1的值的列的列表
df2.withColumn("column_names",udf())
DF1
DF2:
现在我想要DF3
DF3:
我看到这样的问题。尝试过,但是结果不同
我正在考虑使用一个UDF函数,将每个数据帧中的行传递给UDF,然后逐列比较并返回列列表。但是,对于这两个数据帧,都应该按顺序排序,以便将相同的id行发送到udf。这里的分拣操作成本很高。有什么解决方案吗?Python:PySpark版本的我以前的scala代码
import pyspark.sql.functions as f
df1 = spark.read.option("header", "true").csv("test1.csv")
df2 = spark.read.option("header", "true").csv("test2.csv")
columns = df1.columns
df3 = df1.alias("d1").join(df2.alias("d2"), f.col("d1.id") == f.col("d2.id"), "left")
for name in columns:
df3 = df3.withColumn(name + "_temp", f.when(f.col("d1." + name) != f.col("d2." + name), f.lit(name)))
df3.withColumn("column_names", f.concat_ws(",", *map(lambda name: f.col(name + "_temp"), columns))).select("d1.*", "column_names").show()
Scala:这是我解决你问题的最佳方法
val df1 = spark.read.option("header", "true").csv("test1.csv")
val df2 = spark.read.option("header", "true").csv("test2.csv")
val columns = df1.columns
val df3 = df1.alias("d1").join(df2.alias("d2"), col("d1.id") === col("d2.id"), "left")
columns.foldLeft(df3) {(df, name) => df.withColumn(name + "_temp", when(col("d1." + name) =!= col("d2." + name), lit(name)))}
.withColumn("column_names", concat_ws(",", columns.map(name => col(name + "_temp")): _*))
.show(false)
首先,我将两个dataframe连接到df3中,并使用df1中的列。当df1和df2具有相同的id和其他列值时,使用具有列名称值的临时列向左折叠到df3
之后,这些列名的concat_ws和null都消失了,只剩下列名
+---+----+----+-------+------------+
|id |name|sal |Address|column_names|
+---+----+----+-------+------------+
|1 |ABC |5000|US | |
|2 |DEF |4000|UK |Address |
|3 |GHI |3000|JPN |sal |
|4 |JKL |4500|CHN |name,sal |
+---+----+----+-------+------------+
与预期结果不同的是,输出不是列表,而是字符串
p、 我忘了使用PySpark,但这是正常的spark,抱歉。这是您使用UDF的解决方案,我已动态更改了第一个数据帧名称,以便在检查过程中不会出现歧义。请仔细阅读下面的代码,如果有任何问题,请告诉我
>>> from pyspark.sql.functions import *
>>> df.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
| 1| ABC|5000| US|
| 2| DEF|4000| UK|
| 3| GHI|3000| JPN|
| 4| JKL|4500| CHN|
+---+----+----+-------+
>>> df1.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
| 1| ABC|5000| US|
| 2| DEF|4000| CAN|
| 3| GHI|3500| JPN|
| 4|JKLM|4800| CHN|
+---+----+----+-------+
>>> df2 = df.select([col(c).alias("x_"+c) for c in df.columns])
>>> df3 = df1.join(df2, col("id") == col("x_id"), "left")
//udf declaration
>>> def CheckMatch(Column,r):
... check=''
... ColList=Column.split(",")
... for cc in ColList:
... if(r[cc] != r["x_" + cc]):
... check=check + "," + cc
... return check.replace(',','',1).split(",")
>>> CheckMatchUDF = udf(CheckMatch)
//final column that required to select
>>> finalCol = df1.columns
>>> finalCol.insert(len(finalCol), "column_names")
>>> df3.withColumn("column_names", CheckMatchUDF(lit(','.join(df1.columns)),struct([df3[x] for x in df3.columns])))
.select(finalCol)
.show()
+---+----+----+-------+------------+
| id|name| sal|Address|column_names|
+---+----+----+-------+------------+
| 1| ABC|5000| US| []|
| 2| DEF|4000| CAN| [Address]|
| 3| GHI|3500| JPN| [sal]|
| 4|JKLM|4800| CHN| [name, sal]|
+---+----+----+-------+------------+
假设我们可以使用id连接这两个数据集,我认为不需要UDF。这可以通过使用内部连接和函数等来解决 首先,让我们创建两个数据集: df1=spark.createDataFrame[ [1,ABC,5000,美国], [2,DEF,4000,英国], [3,GHI,3000,JPN], [4,JKL,4500,CHN] ],[id、姓名、sal、地址] df2=spark.createDataFrame[ [1,ABC,5000,美国], [2,DEF,4000,CAN], [3,GHI,3500,JPN], [4,JKL_M,4800,CHN] ],[id、姓名、sal、地址] 首先,我们在两个数据集之间进行内部联接,然后生成条件df1[col]!=df2[col]用于除id之外的每列。当列不相等时,我们返回列名,否则返回空字符串。条件列表将包含一个数组的项,最后我们从中删除空项: 从pyspark.sql.functions导入col、array、when、array\u remove 获取除id之外的所有列的条件 条件\=[whendf1[c]!=df2[c],litc.否则为df1.columns中的c,如果c!='id'] 选择_expr=[ 科利德, *[df2[c]用于df2.columns中的c,如果c!=“id”], 数组\移除数组*条件\列\名称 ] df1.joindf2,id.select*select\u expr.show +--+---+--+----+------+ |id |姓名| sal |地址|列名|| +--+---+--+----+------+ |1 | ABC | 5000 |美国|【】| |3 | GHI | 3500 | JPN |[sal]| |2 | DEF | 4000 | CAN |[地址]| |4 | JKL|u M | 4800 | CHN |[姓名,sal]| +--+---+--+----+------+
您可以通过该包在PySpark和Scala中获得该查询构建。 它提供了diff转换,而diff转换正是这样做的 从gresearch.spark.diff导入* 选项=DiffOptions。带有“更改”列 df1.diff_与选项df2,选项“id”。显示 +--+------+--+-----+-----+----+-----+------+-------+ |差异|更改| id |左|名|右|名|左|萨尔|右|萨尔|左|地址|右|地址| +--+------+--+-----+-----+----+-----+------+-------+ |N |[]1 | ABC | ABC | 5000 | 5000 | US | US| |C |[地址]| 2 | DEF | DEF | 4000 | 4000 | UK | CAN| |C |[sal]| 3 | GHI | GHI | 3000 | 3500 | JPN | JPN| |C |[姓名,萨尔]| 4 | JKL | JKL | M | 4500 | 4800 | CHN | CHN| +--+------+--+-----+-----+----+-----+------+-------+
虽然这是一个简单的示例,但当涉及广泛的模式、插入、删除和空值时,差异数据帧可能会变得复杂。该软件包经过了很好的测试,所以您不必担心自己的查询是否正确。您想要pyspark还是spark中的解决方案?是指scala还是python?我在pythonSorting中寻找解决方案是一项代价高昂的操作。-在这种情况下,我认为没有比分类更好的方法了。这里不需要自定义项。在id上使用left join,然后比较列值并创建新的列名称。您不需要在此处执行其他操作。将它们保留为null,然后用户筛选器或array_except将删除null值。array_except仅适用于array_exceptarray*条件,ArrayItnone将为创建新数组带来额外开销,而不需要它。至于过滤器,我认为pyspark只能通过expr或selectExpr提供,或者至少databricks拒绝将其包含在从pyspark.sql.functions导入过滤器中,而且实际上,我有一个sim卡中似乎没有 ilar场景继续,删除链接,提前感谢@Nikk
val df1 = spark.read.option("header", "true").csv("test1.csv")
val df2 = spark.read.option("header", "true").csv("test2.csv")
val columns = df1.columns
val df3 = df1.alias("d1").join(df2.alias("d2"), col("d1.id") === col("d2.id"), "left")
columns.foldLeft(df3) {(df, name) => df.withColumn(name + "_temp", when(col("d1." + name) =!= col("d2." + name), lit(name)))}
.withColumn("column_names", concat_ws(",", columns.map(name => col(name + "_temp")): _*))
.show(false)
+---+----+----+-------+------------+
|id |name|sal |Address|column_names|
+---+----+----+-------+------------+
|1 |ABC |5000|US | |
|2 |DEF |4000|UK |Address |
|3 |GHI |3000|JPN |sal |
|4 |JKL |4500|CHN |name,sal |
+---+----+----+-------+------------+
>>> from pyspark.sql.functions import *
>>> df.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
| 1| ABC|5000| US|
| 2| DEF|4000| UK|
| 3| GHI|3000| JPN|
| 4| JKL|4500| CHN|
+---+----+----+-------+
>>> df1.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
| 1| ABC|5000| US|
| 2| DEF|4000| CAN|
| 3| GHI|3500| JPN|
| 4|JKLM|4800| CHN|
+---+----+----+-------+
>>> df2 = df.select([col(c).alias("x_"+c) for c in df.columns])
>>> df3 = df1.join(df2, col("id") == col("x_id"), "left")
//udf declaration
>>> def CheckMatch(Column,r):
... check=''
... ColList=Column.split(",")
... for cc in ColList:
... if(r[cc] != r["x_" + cc]):
... check=check + "," + cc
... return check.replace(',','',1).split(",")
>>> CheckMatchUDF = udf(CheckMatch)
//final column that required to select
>>> finalCol = df1.columns
>>> finalCol.insert(len(finalCol), "column_names")
>>> df3.withColumn("column_names", CheckMatchUDF(lit(','.join(df1.columns)),struct([df3[x] for x in df3.columns])))
.select(finalCol)
.show()
+---+----+----+-------+------------+
| id|name| sal|Address|column_names|
+---+----+----+-------+------------+
| 1| ABC|5000| US| []|
| 2| DEF|4000| CAN| [Address]|
| 3| GHI|3500| JPN| [sal]|
| 4|JKLM|4800| CHN| [name, sal]|
+---+----+----+-------+------------+