Python 查找pyspark数据帧联接后丢失的行
给定两个spark数据帧,Python 查找pyspark数据帧联接后丢失的行,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,给定两个spark数据帧,csv_-df和other_-df,我需要将它们连接起来,然后确定在连接过程中丢失了csv_-df中的哪些行 以下是我尝试过的: csv_df = self.sqlContext.read.load('csv_table.parquet') csv_df = csv_df.withColumn( "mid", monotonically_increasing_id() ) other_df = self.sqlContex
csv_-df
和other_-df
,我需要将它们连接起来,然后确定在连接过程中丢失了csv_-df
中的哪些行
以下是我尝试过的:
csv_df = self.sqlContext.read.load('csv_table.parquet')
csv_df = csv_df.withColumn(
"mid", monotonically_increasing_id()
)
other_df = self.sqlContext.read.load('other_table.parquet')
joined = csv_df.join(other_df, ['col1', 'col2'])
found_rows = joined.select('mid').distinct()
not_found_ids = csv_df.where(~csv_df.mid.isin(found_rows))
这给了我以下错误:
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
我做错了什么?如何修复代码以获取未连接的行?尽管我仍然想知道为什么我上面发布的代码不起作用,但我已经意识到我可以通过执行另一个连接来回答我的问题:
csv_df = self.sqlContext.read.load('csv_table.parquet')
csv_df = csv_df.withColumn(
"mid", monotonically_increasing_id()
)
other_df = self.sqlContext.read.load('other_table.parquet')
joined = csv_df.join(other_df, ['col1', 'col2'])
found_rows = joined.select('mid').distinct()
not_found_rows = csv_df.selectExpr('mid').subtract(found_rows)
not_found_ids = csv_df.join(not_found_rows, 'mid')