Python 查找pyspark数据帧联接后丢失的行

Python 查找pyspark数据帧联接后丢失的行,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,给定两个spark数据帧,csv_-df和other_-df,我需要将它们连接起来,然后确定在连接过程中丢失了csv_-df中的哪些行 以下是我尝试过的: csv_df = self.sqlContext.read.load('csv_table.parquet') csv_df = csv_df.withColumn( "mid", monotonically_increasing_id() ) other_df = self.sqlContex

给定两个spark数据帧,
csv_-df
other_-df
,我需要将它们连接起来,然后确定在连接过程中丢失了
csv_-df
中的哪些行

以下是我尝试过的:

    csv_df = self.sqlContext.read.load('csv_table.parquet')
    csv_df = csv_df.withColumn(
        "mid", monotonically_increasing_id()
    )
    other_df = self.sqlContext.read.load('other_table.parquet')
    joined = csv_df.join(other_df, ['col1', 'col2'])
    found_rows = joined.select('mid').distinct()
    not_found_ids = csv_df.where(~csv_df.mid.isin(found_rows))
这给了我以下错误:

AttributeError: 'DataFrame' object has no attribute '_get_object_id'

我做错了什么?如何修复代码以获取未连接的行?

尽管我仍然想知道为什么我上面发布的代码不起作用,但我已经意识到我可以通过执行另一个连接来回答我的问题:

csv_df = self.sqlContext.read.load('csv_table.parquet')
csv_df = csv_df.withColumn(
    "mid", monotonically_increasing_id()
)
other_df = self.sqlContext.read.load('other_table.parquet')
joined = csv_df.join(other_df, ['col1', 'col2'])
found_rows = joined.select('mid').distinct()
not_found_rows = csv_df.selectExpr('mid').subtract(found_rows)
not_found_ids = csv_df.join(not_found_rows, 'mid')