Python 如何计算一列中任意两个元素同时出现的次数?

Python 如何计算一列中任意两个元素同时出现的次数?,python,pandas,Python,Pandas,我需要一个代码来计算任何两个标题在同一文档源下同时出现的次数 这是数据 import pandas as pd from itertools import combinations from collections import Counter df = pd.DataFrame({'Title': ['Dead poet society', 'Before sunrise', 'Finding Dory', 'Blood diamond', 'A beautiful mind', 'Blood

我需要一个代码来计算任何两个标题在同一文档源下同时出现的次数

这是数据

import pandas as pd
from itertools import combinations
from collections import Counter
df = pd.DataFrame({'Title': ['Dead poet society',
'Before sunrise',
'Finding Dory',
'Blood diamond',
'A beautiful mind',
'Blood diamond',
'Before sunrise',
'The longest ride',
'Marley and me',
'The longest ride',
'Blood diamond',
'Dead poet society',
'Remember me',
'Inception',
'The longest ride',
'Gone with the wind',
'Dead poet society',
'Before sunrise',
'Midnight in Paris',
'Mean girls'],'1Name': ['Julia Roberts',
'Sandra Bullock',
'Emma Stone',
'Anne Hathaway',
'Amanda Seyfried',
'Anne Hathaway',
'Sandra Bullock',
'Reese Witherspoon',
'Jennifer Aniston',
'Reese Witherspoon',
'Anne Hathaway',
'Julia Roberts',
'Natalie Portman',
'Kate Winslet',
'Reese Witherspoon',
'Scarlett Johansson',
'Julia Roberts',
'Sandra Bullock',
'Meg Ryan',
'Lindsay Lohan'
], '2Place':['London',
'Paris',
'Rome',
'Canada',
'Scotland',
'Canada',
'Paris',
'Denmark',
'Germany',
'Denmark',
'Canada',
'London',
'Bulgaria',
'Sweden',
'Denmark',
'Brazil',
'London',
'Paris',
'Queensland',
'Qatar'], 'Document_Source': ['A','A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'E', 'E', 'E', 'E', 'E']   })
以预期产出为例

死亡诗人社会和日出前:2表示死亡诗人社会和日出前同时出现在两个文献来源中。《死亡诗人社》和《日出前》是两个标题

我使用的代码:

import xlrd
import pandas as pd
sample_df = pd.read_excel('sample_docu1.xlsx')
k=sample_df.groupby(['Document_Source','Title']).count()
print( '{}'.format(k))
我得到的输出:

                                                       Name  \
Title                                              A beautiful mind   
Document_Source                                                       
Agha-Hossein, M. M., El-Jouzi, S., Elmualim, A....              NaN   
Al Horr, Y., Arif, M., Kaushik, A., Mazroei, A....              1.0   
Altomonte, S., & Schiavon, S. (2013). Occupant ...              NaN   
Andelin, M., Sarasoja, A. L., Ventovuori, T., &...              NaN   
Armitage, L., & Murugan, A. (2013). The human g...              NaN   
Armitage, L., Murugan, A., & Kato, H. (2011). G...              NaN   
Azar, E., Nikolopoulou, C., & Papadopoulos, S. ...              1.0   
Baharum, M. R., & Pitt, M. (2009). Determining ...              NaN   
Baird, G. (2011). Did that building feel good f...              NaN   
Baird, G., & Penwell, J. (2012). Designers’ int...              NaN   
Baird, G., & Thompson, J. (2012). Lighting cond...              NaN  
.
.
.
.
.
. 
预期产出:

Dead poet society   Before sunrise  2
Dead poet society   Finding Dory    0
Dead poet society   Blood diamond   2
Dead poet society   A beautiful mind    0
Dead poet society   The longest ride    1
Dead poet society   Marley and me   1
Dead poet society   Remember me 0
Dead poet society   Inception   0
Dead poet society   Gone with the wind  0
Dead poet society   Midnight in Paris   1
Dead poet society   Mean girls  1
Dead poet society   Butterfly effect    0
Dead poet society   Letters to Juliet   0
Dead poet society   Pretty woman    0
Dead poet society   My Best Friend's Wedding    0
Dead poet society   The pursuit of happiness    0
Dead poet society   Dear john   0
Dead poet society   There's Something About Mary    0
Before sunrise  Finding Dory    0
Before sunrise  Blood diamond   2
Before sunrise  A beautiful mind    1
Before sunrise  The longest ride    1
Before sunrise  Marley and me   0
Before sunrise  Remember me 0
Before sunrise  Inception   0
Before sunrise  Gone with the wind  1
Before sunrise  Midnight in Paris   1
Before sunrise  Mean girls  1
Before sunrise  Butterfly effect    0
Before sunrise  Letters to Juliet   0
Before sunrise  Pretty woman    0
.
.
.
.
你可以试试

from itertools import combinations
from collections import Counter

comb = df.groupby(['Document_Source'])["Title"].apply(
           lambda x: [tuple(sorted(pair)) for pair in combinations(x, 2)]
       ).sum()
result = Counter(comb)
我们使用组合来制作一对电影。用计数器数数

groupby['Document\u Source'][Title]按Document\u Source列对数据进行分组,并选择标题系列

然后,我们对每组数据使用apply函数。对于每个组,我们使用组合sx,2来获得成对的值。注意,我们对combinationsx,2给出的值进行排序,并使其成为具有

f = lambda x: [tuple(sorted(pair)) for pair in combinations(x, 2)]
# b = ["A", "B", "C"]
# f(b)
# [('A', 'B'), ('A', 'C'), ('B', 'C')]    
在apply函数之后,每个组将有一个元组列表

3Docu_Source
A    [(Before sunrise, Dead poet society), (Dead po...
B    [(A beautiful mind, Blood diamond), (A beautif...
C    [(Marley and me, The longest ride), (Blood dia...
D    [(Inception, Remember me), (Remember me, The l...
E    [(Dead poet society, Gone with the wind), (Bef...
Name: 0Title, dtype: object
我们在末尾使用sum,因为我们希望对每个组的所有元组列表进行聚合。使用OP的数据,我们得到一个元组列表

[('Before sunrise', 'Dead poet society'),
 ('Dead poet society', 'Finding Dory'),
 ('Blood diamond', 'Dead poet society'),
 ('Before sunrise', 'Finding Dory'),
 ('Before sunrise', 'Blood diamond'),
 ('Blood diamond', 'Finding Dory'),
 ('A beautiful mind', 'Blood diamond'),
 ('A beautiful mind', 'Before sunrise'),
 ('A beautiful mind', 'The longest ride'),
 ('Before sunrise', 'Blood diamond'),
 ('Blood diamond', 'The longest ride'),
 ('Before sunrise', 'The longest ride'),
 ('Marley and me', 'The longest ride'),
 ('Blood diamond', 'Marley and me'),
 ('Dead poet society', 'Marley and me'),
 ('Blood diamond', 'The longest ride'),
 ('Dead poet society', 'The longest ride'),
 ('Blood diamond', 'Dead poet society'),
 ('Inception', 'Remember me'),
 ('Remember me', 'The longest ride'),
 ('Inception', 'The longest ride'),
 ('Dead poet society', 'Gone with the wind'),
 ('Before sunrise', 'Gone with the wind'),
 ('Gone with the wind', 'Midnight in Paris'),
 ('Gone with the wind', 'Mean girls'),
 ('Before sunrise', 'Dead poet society'),
 ('Dead poet society', 'Midnight in Paris'),
 ('Dead poet society', 'Mean girls'),
 ('Before sunrise', 'Midnight in Paris'),
 ('Before sunrise', 'Mean girls'),
 ('Mean girls', 'Midnight in Paris')]

计数器计算每对的发生次数。

这里是另一个解决方案,我觉得更直观。一个主要区别是我的结果字典使用了frozenset,因此键不依赖于顺序,即result[frozenset{'A','B'}]=result[frozenset{'B','A'}


使用to_dict创建可复制的df。给出预期输出,帮助人们理解您的问题。您能使用df.head30.to_dict提供一些数据吗?我已经提供了数据。谢谢。请解释这行代码comb=df.groupby['Document_Source'][Title]。将lambda x:[tuplesortedpair应用于组合中的pair sx,2]。sum@SanchariGhosh如果你添加了数据,正如我在评论中提到的,然后我可以用这些数据向你和未来的观众解释。你需要什么数据?如果我想把结果做成一个列表格式,我应该做什么改变?我还想包括所有对,即,对于没有公共文档源的对,它将显示0。@SanchariGhosh很抱歉,我最近很忙,无法帮助:无法获得所需的输出。请解释您在这里所做的工作。paircalc计算单个对的公共源数。。然后对于每个组合,我称之为pairCal。我上面的结果和你的一致。为什么输出不理想?
import pandas as pd
from itertools import combinations
from collections import Counter, defaultdict

def paircalc(a, b):
    a_sources = set(df.loc[df.Title == a, 'Document_Source'])
    b_sources = set(df.loc[df.Title == b, 'Document_Source'])
    return len(a_sources & b_sources)

result = defaultdict(int)

for comb in combinations(set(df.Title), 2):
    result[frozenset(comb)] = paircalc(*comb)

# defaultdict(int,
#             {frozenset({'A beautiful mind', 'Marley and me'}): 0,
#              frozenset({'A beautiful mind', 'Finding Dory'}): 0,
#              frozenset({'A beautiful mind', 'The longest ride'}): 1,
#              frozenset({'A beautiful mind', 'Remember me'}): 0,
#              frozenset({'A beautiful mind', 'Gone with the wind'}): 0,
#              frozenset({'A beautiful mind', 'Mean girls'}): 0,
#              frozenset({'A beautiful mind', 'Dead poet society'}): 0,
# ...
#              frozenset({'Before sunrise', 'Dead poet society'}): 2,
#              frozenset({'Blood diamond', 'Dead poet society'}): 2,
#              frozenset({'Inception', 'Midnight in Paris'}): 0,
#              frozenset({'Before sunrise', 'Midnight in Paris'}): 1,
#              frozenset({'Blood diamond', 'Midnight in Paris'}): 0,
#              frozenset({'Before sunrise', 'Inception'}): 0,
#              frozenset({'Blood diamond', 'Inception'}): 0,
#              frozenset({'Before sunrise', 'Blood diamond'}): 2})