Python 如何在pandas中拆分和比较数据帧
我在pyton中有两个不同的数据帧,如下所示:Python 如何在pandas中拆分和比较数据帧,python,pandas,Python,Pandas,我在pyton中有两个不同的数据帧,如下所示: import pandas df = pd.DataFrame({'AAA' : ["a1","a2","a3","a4","a5","a6","a7"], 'BBB' : ["c1","c2","c2","c2","c3","c3","c1"]}) df2 = pd.DataFrame({'AAA' : ["a1","a2","a4","a6","a7","a8","a9"],
import pandas
df = pd.DataFrame({'AAA' : ["a1","a2","a3","a4","a5","a6","a7"],
'BBB' : ["c1","c2","c2","c2","c3","c3","c1"]})
df2 = pd.DataFrame({'AAA' : ["a1","a2","a4","a6","a7","a8","a9"],
'BBB' : ["c11","c21","c21","c12","c13","c13","c11"]})
我想比较“AAA”
的值,并根据“BBB”
组查找类似值的数量。
例如,c1
和c11
之间的相似性为1(a1
)
c2
,c21
之间的相似性为2(a2
,a4
)
换句话说,我想匹配
df
和df2
中的所有行对,其中字符串df2['BBB']
以字符串df['BBB']
和那些匹配行对开始,以便所有值df['AAA']
其中df['AAA']
等于df2['AAA']
以下代码计算您想要的相似性(它不使用CCC
列):
这可以计算如下:
# merge both dataframes on column 'AAA' since
# in the end only the rows are of interest
# for which AAA is equal in both frames
merged= df.merge(df2, on='AAA', suffixes=['_df', '_df2'])
# define a function that can be used
# to check the BBB-string of df2 starts
# with the BBB-string of df
def check(o):
return o['BBB_df2'].startswith(o['BBB_df'])
# apply it to the dataframe to filter the rows
matches= merged.apply(check , axis='columns')
# now aggregate only the rows to which both
# criterias apply
result= merged[matches].groupby(['BBB_df', 'BBB_df2']).agg({'AAA': ['nunique', set]})
result.columns= ['similarity', 'AAA_values']
result.reset_index()
输出为:
Out[111]:
BBB_df BBB_df2 similarity AAA_values
0 c1 c11 1 {a1}
1 c1 c13 1 {a7}
2 c2 c21 2 {a2, a4}
输入数据:
import pandas
df = pd.DataFrame({'AAA' : ["a1","a2","a3","a4","a5","a6","a7"],
'BBB' : ["c1","c2","c2","c2","c3","c3","c1"]})
df2 = pd.DataFrame({'AAA' : ["a1","a2","a4","a6","a7","a8","a9"],
'BBB' : ["c11","c21","c21","c12","c13","c13","c11"]})
对不起,我一点也不懂这个问题,你能重述一下你是如何使用CCC的吗?如果你没有,为什么要给我们看?我把CCC移除了
import pandas
df = pd.DataFrame({'AAA' : ["a1","a2","a3","a4","a5","a6","a7"],
'BBB' : ["c1","c2","c2","c2","c3","c3","c1"]})
df2 = pd.DataFrame({'AAA' : ["a1","a2","a4","a6","a7","a8","a9"],
'BBB' : ["c11","c21","c21","c12","c13","c13","c11"]})