Python 如何在pandas中拆分和比较数据帧

Python 如何在pandas中拆分和比较数据帧,python,pandas,Python,Pandas,我在pyton中有两个不同的数据帧,如下所示: import pandas df = pd.DataFrame({'AAA' : ["a1","a2","a3","a4","a5","a6","a7"], 'BBB' : ["c1","c2","c2","c2","c3","c3","c1"]}) df2 = pd.DataFrame({'AAA' : ["a1","a2","a4","a6","a7","a8","a9"],

我在pyton中有两个不同的数据帧,如下所示:

import pandas
df = pd.DataFrame({'AAA' : ["a1","a2","a3","a4","a5","a6","a7"], 
                   'BBB' : ["c1","c2","c2","c2","c3","c3","c1"]})
df2 = pd.DataFrame({'AAA' : ["a1","a2","a4","a6","a7","a8","a9"], 
                    'BBB' : ["c11","c21","c21","c12","c13","c13","c11"]})
我想比较
“AAA”
的值,并根据
“BBB”
组查找类似值的数量。 例如,
c1
c11
之间的相似性为1(
a1
c2
c21
之间的相似性为2(
a2
a4


换句话说,我想匹配
df
df2
中的所有行对,其中字符串
df2['BBB']
以字符串
df['BBB']
和那些匹配行对开始,以便所有值
df['AAA']
其中
df['AAA']
等于
df2['AAA']

以下代码计算您想要的相似性(它不使用
CCC
列):


这可以计算如下:

# merge both dataframes on column 'AAA' since
# in the end only the rows are of interest
# for which AAA is equal in both frames
merged= df.merge(df2, on='AAA', suffixes=['_df', '_df2'])

# define a function that can be used
# to check the BBB-string of df2 starts
# with the BBB-string of df
def check(o):
    return o['BBB_df2'].startswith(o['BBB_df'])

# apply it to the dataframe to filter the rows    
matches= merged.apply(check , axis='columns')
# now aggregate only the rows to which both
# criterias apply
result= merged[matches].groupby(['BBB_df', 'BBB_df2']).agg({'AAA': ['nunique', set]})
result.columns= ['similarity', 'AAA_values']
result.reset_index()
输出为:

Out[111]: 
  BBB_df BBB_df2  similarity AAA_values
0     c1     c11           1       {a1}
1     c1     c13           1       {a7}
2     c2     c21           2   {a2, a4}
输入数据:

import pandas
df = pd.DataFrame({'AAA' : ["a1","a2","a3","a4","a5","a6","a7"], 
                   'BBB' : ["c1","c2","c2","c2","c3","c3","c1"]})
df2 = pd.DataFrame({'AAA' : ["a1","a2","a4","a6","a7","a8","a9"], 
                    'BBB' : ["c11","c21","c21","c12","c13","c13","c11"]})

对不起,我一点也不懂这个问题,你能重述一下你是如何使用CCC的吗?如果你没有,为什么要给我们看?我把CCC移除了
import pandas
df = pd.DataFrame({'AAA' : ["a1","a2","a3","a4","a5","a6","a7"], 
                   'BBB' : ["c1","c2","c2","c2","c3","c3","c1"]})
df2 = pd.DataFrame({'AAA' : ["a1","a2","a4","a6","a7","a8","a9"], 
                    'BBB' : ["c11","c21","c21","c12","c13","c13","c11"]})