Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/328.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 仅基于pyspark中的一列的两个数据帧之间的差异_Python_Apache Spark_Pyspark - Fatal编程技术网

Python 仅基于pyspark中的一列的两个数据帧之间的差异

Python 仅基于pyspark中的一列的两个数据帧之间的差异,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我正在寻找一种基于一列查找两个数据帧差异的方法。例如: from pyspark.sql import SQLContext sc = SparkContext() sql_context = SQLContext(sc) df_a = sql_context.createDataFrame([("fa", 3), ("fb", 5), ("fc", 7)], ["first name", "id"]) df_b = sql_context.createDataFrame([("la",

我正在寻找一种基于一列查找两个数据帧差异的方法。例如:

from pyspark.sql import SQLContext

sc = SparkContext()
sql_context = SQLContext(sc)

df_a = sql_context.createDataFrame([("fa", 3), ("fb", 5), ("fc", 7)], ["first name", "id"])

df_b = sql_context.createDataFrame([("la", 3), ("lb", 10), ("lc", 13)], ["last name", "id"])
DataFrame A:

+----------+---+
|first name| id|
+----------+---+
|        fa|  3|
|        fb|  5|
|        fc|  7|
+----------+---+
DataFrame B:

+---------+---+
|last name| id|
+---------+---+
|       la|  3|
|       lb| 10|
|       lc| 13|
+---------+---+
我的目标是找出DataFrame A和DataFrame B的区别,考虑到列id,输出将是以下DataFrame

    +---------+---+
    |last name| id|
    +---------+---+
    |       lb| 10|
    |       lc| 13|
    +---------+---+
我不想使用以下方法:

a_ids = set(df_a.rdd.map(lambda r: r.id).collect())
df_c = df_b.filter(~col('id').isin(a_ids))
我正在寻找一种高效的方法(在内存和速度方面),我不必收集ID(ID的大小可以是数十亿),可能类似RDDs SubtractByKey,但用于数据帧

    +---------+---+
    |last name| id|
    +---------+---+
    |       lb| 10|
    |       lc| 13|
    +---------+---+

PS:我可以将df_a映射到RDD,但我不想将df_b映射到RDD

您可以在列
id
上执行
左反
连接:

df_b.join(df_a.select('id'), how='left_anti', on=['id']).show()
+---+---------+
| id|last name|
+---+---------+
| 10|       lb|
| 13|       lc|
+---+---------+