Python 如何在pandas中拆分和比较数据帧_Python_Pandas

Python 如何在pandas中拆分和比较数据帧

python pandas

Python 如何在pandas中拆分和比较数据帧,python,pandas,Python,Pandas,我在pyton中有两个不同的数据帧，如下所示： import pandas df = pd.DataFrame({'AAA' : ["a1","a2","a3","a4","a5","a6","a7"], 'BBB' : ["c1","c2","c2","c2","c3","c3","c1"]}) df2 = pd.DataFrame({'AAA' : ["a1","a2","a4","a6","a7","a8","a9"],

我在pyton中有两个不同的数据帧，如下所示：

import pandas
df = pd.DataFrame({'AAA' : ["a1","a2","a3","a4","a5","a6","a7"], 
                   'BBB' : ["c1","c2","c2","c2","c3","c3","c1"]})
df2 = pd.DataFrame({'AAA' : ["a1","a2","a4","a6","a7","a8","a9"], 
                    'BBB' : ["c11","c21","c21","c12","c13","c13","c11"]})

我想比较

“AAA”

的值，并根据

“BBB”

组查找类似值的数量。例如，

c1

和

c11

之间的相似性为1（

a1

）

c2

，

c21

之间的相似性为2（

a2

，

a4

）

换句话说，我想匹配

df

和

df2

中的所有行对，其中字符串

df2['BBB']

以字符串

df['BBB']

和那些匹配行对开始，以便所有值

df['AAA']

其中

df['AAA']

等于

df2['AAA']

以下代码计算您想要的相似性（它不使用

CCC

列）：

这可以计算如下：

# merge both dataframes on column 'AAA' since
# in the end only the rows are of interest
# for which AAA is equal in both frames
merged= df.merge(df2, on='AAA', suffixes=['_df', '_df2'])

# define a function that can be used
# to check the BBB-string of df2 starts
# with the BBB-string of df
def check(o):
    return o['BBB_df2'].startswith(o['BBB_df'])

# apply it to the dataframe to filter the rows    
matches= merged.apply(check , axis='columns')
# now aggregate only the rows to which both
# criterias apply
result= merged[matches].groupby(['BBB_df', 'BBB_df2']).agg({'AAA': ['nunique', set]})
result.columns= ['similarity', 'AAA_values']
result.reset_index()

输出为：

Out[111]: 
  BBB_df BBB_df2  similarity AAA_values
0     c1     c11           1       {a1}
1     c1     c13           1       {a7}
2     c2     c21           2   {a2, a4}

输入数据：

import pandas
df = pd.DataFrame({'AAA' : ["a1","a2","a3","a4","a5","a6","a7"], 
                   'BBB' : ["c1","c2","c2","c2","c3","c3","c1"]})
df2 = pd.DataFrame({'AAA' : ["a1","a2","a4","a6","a7","a8","a9"], 
                    'BBB' : ["c11","c21","c21","c12","c13","c13","c11"]})

对不起，我一点也不懂这个问题，你能重述一下你是如何使用CCC的吗？如果你没有，为什么要给我们看？我把CCC移除了

import pandas
df = pd.DataFrame({'AAA' : ["a1","a2","a3","a4","a5","a6","a7"], 
                   'BBB' : ["c1","c2","c2","c2","c3","c3","c1"]})
df2 = pd.DataFrame({'AAA' : ["a1","a2","a4","a6","a7","a8","a9"], 
                    'BBB' : ["c11","c21","c21","c12","c13","c13","c11"]})