Python 如何计算一列中任意两个元素同时出现的次数?
我需要一个代码来计算任何两个标题在同一文档源下同时出现的次数 这是数据Python 如何计算一列中任意两个元素同时出现的次数?,python,pandas,Python,Pandas,我需要一个代码来计算任何两个标题在同一文档源下同时出现的次数 这是数据 import pandas as pd from itertools import combinations from collections import Counter df = pd.DataFrame({'Title': ['Dead poet society', 'Before sunrise', 'Finding Dory', 'Blood diamond', 'A beautiful mind', 'Blood
import pandas as pd
from itertools import combinations
from collections import Counter
df = pd.DataFrame({'Title': ['Dead poet society',
'Before sunrise',
'Finding Dory',
'Blood diamond',
'A beautiful mind',
'Blood diamond',
'Before sunrise',
'The longest ride',
'Marley and me',
'The longest ride',
'Blood diamond',
'Dead poet society',
'Remember me',
'Inception',
'The longest ride',
'Gone with the wind',
'Dead poet society',
'Before sunrise',
'Midnight in Paris',
'Mean girls'],'1Name': ['Julia Roberts',
'Sandra Bullock',
'Emma Stone',
'Anne Hathaway',
'Amanda Seyfried',
'Anne Hathaway',
'Sandra Bullock',
'Reese Witherspoon',
'Jennifer Aniston',
'Reese Witherspoon',
'Anne Hathaway',
'Julia Roberts',
'Natalie Portman',
'Kate Winslet',
'Reese Witherspoon',
'Scarlett Johansson',
'Julia Roberts',
'Sandra Bullock',
'Meg Ryan',
'Lindsay Lohan'
], '2Place':['London',
'Paris',
'Rome',
'Canada',
'Scotland',
'Canada',
'Paris',
'Denmark',
'Germany',
'Denmark',
'Canada',
'London',
'Bulgaria',
'Sweden',
'Denmark',
'Brazil',
'London',
'Paris',
'Queensland',
'Qatar'], 'Document_Source': ['A','A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'E', 'E', 'E', 'E', 'E'] })
以预期产出为例
死亡诗人社会和日出前:2表示死亡诗人社会和日出前同时出现在两个文献来源中。《死亡诗人社》和《日出前》是两个标题
我使用的代码:
import xlrd
import pandas as pd
sample_df = pd.read_excel('sample_docu1.xlsx')
k=sample_df.groupby(['Document_Source','Title']).count()
print( '{}'.format(k))
我得到的输出:
Name \
Title A beautiful mind
Document_Source
Agha-Hossein, M. M., El-Jouzi, S., Elmualim, A.... NaN
Al Horr, Y., Arif, M., Kaushik, A., Mazroei, A.... 1.0
Altomonte, S., & Schiavon, S. (2013). Occupant ... NaN
Andelin, M., Sarasoja, A. L., Ventovuori, T., &... NaN
Armitage, L., & Murugan, A. (2013). The human g... NaN
Armitage, L., Murugan, A., & Kato, H. (2011). G... NaN
Azar, E., Nikolopoulou, C., & Papadopoulos, S. ... 1.0
Baharum, M. R., & Pitt, M. (2009). Determining ... NaN
Baird, G. (2011). Did that building feel good f... NaN
Baird, G., & Penwell, J. (2012). Designers’ int... NaN
Baird, G., & Thompson, J. (2012). Lighting cond... NaN
.
.
.
.
.
.
预期产出:
Dead poet society Before sunrise 2
Dead poet society Finding Dory 0
Dead poet society Blood diamond 2
Dead poet society A beautiful mind 0
Dead poet society The longest ride 1
Dead poet society Marley and me 1
Dead poet society Remember me 0
Dead poet society Inception 0
Dead poet society Gone with the wind 0
Dead poet society Midnight in Paris 1
Dead poet society Mean girls 1
Dead poet society Butterfly effect 0
Dead poet society Letters to Juliet 0
Dead poet society Pretty woman 0
Dead poet society My Best Friend's Wedding 0
Dead poet society The pursuit of happiness 0
Dead poet society Dear john 0
Dead poet society There's Something About Mary 0
Before sunrise Finding Dory 0
Before sunrise Blood diamond 2
Before sunrise A beautiful mind 1
Before sunrise The longest ride 1
Before sunrise Marley and me 0
Before sunrise Remember me 0
Before sunrise Inception 0
Before sunrise Gone with the wind 1
Before sunrise Midnight in Paris 1
Before sunrise Mean girls 1
Before sunrise Butterfly effect 0
Before sunrise Letters to Juliet 0
Before sunrise Pretty woman 0
.
.
.
.
你可以试试
from itertools import combinations
from collections import Counter
comb = df.groupby(['Document_Source'])["Title"].apply(
lambda x: [tuple(sorted(pair)) for pair in combinations(x, 2)]
).sum()
result = Counter(comb)
我们使用组合来制作一对电影。用计数器数数
groupby['Document\u Source'][Title]按Document\u Source列对数据进行分组,并选择标题系列
然后,我们对每组数据使用apply函数。对于每个组,我们使用组合sx,2来获得成对的值。注意,我们对combinationsx,2给出的值进行排序,并使其成为具有
f = lambda x: [tuple(sorted(pair)) for pair in combinations(x, 2)]
# b = ["A", "B", "C"]
# f(b)
# [('A', 'B'), ('A', 'C'), ('B', 'C')]
在apply函数之后,每个组将有一个元组列表
3Docu_Source
A [(Before sunrise, Dead poet society), (Dead po...
B [(A beautiful mind, Blood diamond), (A beautif...
C [(Marley and me, The longest ride), (Blood dia...
D [(Inception, Remember me), (Remember me, The l...
E [(Dead poet society, Gone with the wind), (Bef...
Name: 0Title, dtype: object
我们在末尾使用sum,因为我们希望对每个组的所有元组列表进行聚合。使用OP的数据,我们得到一个元组列表
[('Before sunrise', 'Dead poet society'),
('Dead poet society', 'Finding Dory'),
('Blood diamond', 'Dead poet society'),
('Before sunrise', 'Finding Dory'),
('Before sunrise', 'Blood diamond'),
('Blood diamond', 'Finding Dory'),
('A beautiful mind', 'Blood diamond'),
('A beautiful mind', 'Before sunrise'),
('A beautiful mind', 'The longest ride'),
('Before sunrise', 'Blood diamond'),
('Blood diamond', 'The longest ride'),
('Before sunrise', 'The longest ride'),
('Marley and me', 'The longest ride'),
('Blood diamond', 'Marley and me'),
('Dead poet society', 'Marley and me'),
('Blood diamond', 'The longest ride'),
('Dead poet society', 'The longest ride'),
('Blood diamond', 'Dead poet society'),
('Inception', 'Remember me'),
('Remember me', 'The longest ride'),
('Inception', 'The longest ride'),
('Dead poet society', 'Gone with the wind'),
('Before sunrise', 'Gone with the wind'),
('Gone with the wind', 'Midnight in Paris'),
('Gone with the wind', 'Mean girls'),
('Before sunrise', 'Dead poet society'),
('Dead poet society', 'Midnight in Paris'),
('Dead poet society', 'Mean girls'),
('Before sunrise', 'Midnight in Paris'),
('Before sunrise', 'Mean girls'),
('Mean girls', 'Midnight in Paris')]
计数器计算每对的发生次数。这里是另一个解决方案,我觉得更直观。一个主要区别是我的结果字典使用了frozenset,因此键不依赖于顺序,即result[frozenset{'A','B'}]=result[frozenset{'B','A'}
使用to_dict创建可复制的df。给出预期输出,帮助人们理解您的问题。您能使用df.head30.to_dict提供一些数据吗?我已经提供了数据。谢谢。请解释这行代码comb=df.groupby['Document_Source'][Title]。将lambda x:[tuplesortedpair应用于组合中的pair sx,2]。sum@SanchariGhosh如果你添加了数据,正如我在评论中提到的,然后我可以用这些数据向你和未来的观众解释。你需要什么数据?如果我想把结果做成一个列表格式,我应该做什么改变?我还想包括所有对,即,对于没有公共文档源的对,它将显示0。@SanchariGhosh很抱歉,我最近很忙,无法帮助:无法获得所需的输出。请解释您在这里所做的工作。paircalc计算单个对的公共源数。。然后对于每个组合,我称之为pairCal。我上面的结果和你的一致。为什么输出不理想?
import pandas as pd
from itertools import combinations
from collections import Counter, defaultdict
def paircalc(a, b):
a_sources = set(df.loc[df.Title == a, 'Document_Source'])
b_sources = set(df.loc[df.Title == b, 'Document_Source'])
return len(a_sources & b_sources)
result = defaultdict(int)
for comb in combinations(set(df.Title), 2):
result[frozenset(comb)] = paircalc(*comb)
# defaultdict(int,
# {frozenset({'A beautiful mind', 'Marley and me'}): 0,
# frozenset({'A beautiful mind', 'Finding Dory'}): 0,
# frozenset({'A beautiful mind', 'The longest ride'}): 1,
# frozenset({'A beautiful mind', 'Remember me'}): 0,
# frozenset({'A beautiful mind', 'Gone with the wind'}): 0,
# frozenset({'A beautiful mind', 'Mean girls'}): 0,
# frozenset({'A beautiful mind', 'Dead poet society'}): 0,
# ...
# frozenset({'Before sunrise', 'Dead poet society'}): 2,
# frozenset({'Blood diamond', 'Dead poet society'}): 2,
# frozenset({'Inception', 'Midnight in Paris'}): 0,
# frozenset({'Before sunrise', 'Midnight in Paris'}): 1,
# frozenset({'Blood diamond', 'Midnight in Paris'}): 0,
# frozenset({'Before sunrise', 'Inception'}): 0,
# frozenset({'Blood diamond', 'Inception'}): 0,
# frozenset({'Before sunrise', 'Blood diamond'}): 2})