Python 合并数据帧时丢失的条目数
在一个练习中,我被要求将3个数据帧与内部联接(df1+df2+df3=mergedDf)合并,然后在另一个问题中,我被要求告诉我在执行此三向合并时丢失了多少个条目Python 合并数据帧时丢失的条目数,python,pandas,dataframe,Python,Pandas,Dataframe,在一个练习中,我被要求将3个数据帧与内部联接(df1+df2+df3=mergedDf)合并,然后在另一个问题中,我被要求告诉我在执行此三向合并时丢失了多少个条目 #DataFrame1 df1 = pd.DataFrame(columns=["Goals","Medals"],data=[[5,2],[1,0],[3,1]]) df1.index = ['Argentina','Angola','Bolivia'] print(df1) Goals Medals
#DataFrame1
df1 = pd.DataFrame(columns=["Goals","Medals"],data=[[5,2],[1,0],[3,1]])
df1.index = ['Argentina','Angola','Bolivia']
print(df1)
Goals Medals
Argentina 5 2
Angola 1 0
Bolivia 3 1
#DataFrame2
df2 = pd.DataFrame(columns=["Dates","Medals"],data=[[1,0],[2,1],[2,2])
df2.index = ['Venezuela','Africa']
print(df2)
Dates Medals
Venezuela 1 0
Africa 2 1
Argentina 2 2
#DataFrame3
df3 = pd.DataFrame(columns=["Players","Goals"],data=[[11,5],[11,1],[10,0]])
df3.index = ['Argentina','Australia','Belgica']
print(df3)
Players Goals
Argentina 11 5
Australia 11 1
Spain 10 0
#mergedDf
mergedDf = pd.merge(df1,df2,how='inner',left_index=True, right_index=True)
mergedDf = pd.merge(mergedDf,df3,how='inner',left_index=True, right_index=True)
print(mergedDF)
Goals_X Medals_X Dates Medals_Y Players Goals_Y
Argentina 5 2 2 2 11 2
#Calculate number of lost entries by code
我试着用外部连接合并所有内容,然后减去mergedDf,但我不知道怎么做,有人能帮我吗?
您可以在merge中将True传递给
指示器
df1=pd.DataFrame({'A':[1,2,3],'B':[1,1,1]})
df2=pd.DataFrame({'A':[2,3],'B':[1,1]})
df1.merge(df2,on='A',how='inner')
Out[257]:
A B_x B_y
0 2 1 1
1 3 1 1
df1.merge(df2,on='A',how='outer',indicator =True)
Out[258]:
A B_x B_y _merge
0 1 1 NaN left_only
1 2 1 1.0 both
2 3 1 1.0 both
mergedf=df1.merge(df2,on='A',how='outer',indicator =True)
然后使用value\u计数
您知道在执行internal
时损失了多少,因为当how='internal'
mergedf['_merge'].value_counts()
Out[260]:
both 2
left_only 1
right_only 0
Name: _merge, dtype: int64
对于具有两个合并列的3个df和过滤器,单词都是
df1.merge(df2, on='A',how='outer',indicator =True).rename(columns={'_merge':'merge'}).merge(df3, on='A',how='outer',indicator =True)
具有外部联接和参数指示符的解决方案,通过True
值的总和(如1
s)对两个指示符列a
和b
中都没有的行进行最后计数:
另一种解决方案是使用内部联接和求和每个不匹配的索引的筛选值mergedDf.index
:
mergedDf = pd.merge(df1,df2,how='inner',left_index=True, right_index=True)
mergedDf = pd.merge(mergedDf,df3,how='inner',left_index=True, right_index=True)
vals = mergedDf.index
print (vals)
Index(['Argentina'], dtype='object')
dfs = [df1, df2, df3]
missing = sum((~x.index.isin(vals)).sum() for x in dfs)
print (missing)
6
另一种解决方案,如果每个索引中的值唯一:
dfs = [df1, df2, df3]
L = [set(x.index) for x in dfs]
#https://stackoverflow.com/a/25324329/2901002
missing = len(set.union(*L) - set.intersection(*L))
print (missing)
6
我找到了一个简单但有效的解决方案:
合并3个数据帧(内部和外部):
请发布一个.OP need我被要求用内部连接合并3个数据帧(df1+df2+df3=mergedDf)
dfs = [df1, df2, df3]
L = [set(x.index) for x in dfs]
#https://stackoverflow.com/a/25324329/2901002
missing = len(set.union(*L) - set.intersection(*L))
print (missing)
6
df1 = Df1()
df2 = Df2()
df3 = Df3()
inner = pd.merge(pd.merge(df1,df2,on='<Common column>',how='inner'),df3,on='<Common column>',how='inner')
outer = pd.merge(pd.merge(df1,df2,on='<Common column>',how='outer'),df3,on='<Common column>',how='outer')
return (len(outer)-len(inner))