Python 按组统计数据帧中列表的重复项_Python_Pandas_Data Science

Python 按组统计数据帧中列表的重复项

python pandas

Python 按组统计数据帧中列表的重复项,python,pandas,data-science,Python,Pandas,Data Science,我有一个数据帧，当前看起来像这样： image source label bookshelf A [flora, jar, plant] bookshelf B [indoor, shelf, wall] bookshelf C [furniture, shelf, shelving] cact

我有一个数据帧，当前看起来像这样：

image         source                               label
bookshelf     A                      [flora, jar, plant]
bookshelf     B                    [indoor, shelf, wall]
bookshelf     C             [furniture, shelf, shelving]
cactus        A                     [flora, plant, vine]
cactus        B                [building, outdoor, tree]
cactus        C                  [home, house, property]
cars          A          [parking, parking lot, vehicle]
cars          B                     [car, outdoor, tree]
cars          C            [car, motor vehicle, vehicle]

我想得到的是每个

源

每个

图像

的重复

标签

的计数，即：

对于

图像

书架，源B和源C共享“书架”标签（B+=1；C+=1）

对于

图像

仙人掌，没有任何源共享相同的标签

对于

图像

车辆，源B和C共享标签“车辆”（B+=1；C+=1），源A和C共享标签“车辆”（A+=1；C+=1）

响应对象将是源共享标签的次数。在上述示例中，（1）将B和C计数各增加1，（3）将B和C计数各增加1，将A和C计数各增加1：

{'A'：1，'B'：2，'C'：3}

可复制示例：

从导入数据帧
df=数据帧({
'图片'：['bookshelf'，'bookshelf'，'bookshelf'，'bookshelf'，
‘仙人掌’、‘仙人掌’、‘仙人掌’，
“汽车”、“汽车”、“汽车”]，
'来源'：['A'，'B'，'C'，
‘A’、‘B’、‘C’，
“A”、“B”、“C”]，
“标签”：[
['flora'，'jar'，'plant']，
[“室内”、“架子”、“墙壁”]，
[“家具”、“架子”、“架子”]，
[“植物”、“植物”、“藤本植物”]，
[‘建筑物’、‘室外’、‘树’]，
[“家”、“房子”、“财产”]，
[“停车场”、“停车场”、“车辆”]，
[‘汽车’、‘户外’、‘树’]，
[‘汽车’、‘机动车辆’、‘车辆’]]
},
列=['image'，'source'，'label']
)

虽然每个源/图像通常有3个标签，但这不能保证。

关于如何以良好的表现实现这一点，有什么想法吗？我有几百万条这样的记录要处理…

这应该可以完成这项工作：

from collections import Counter
sources = df['source'].unique()
output = {source: 0 for source in sources}
for image, sub_df in df.groupby('image'):
    counts = Counter(sub_df['label'].sum())
    for image, source, labels in sub_df.itertuples(index=False):
        for label in labels:
            output[source] += counts[label] - 1
print(output)

标签如何表示？串？列表？label
是一个字符串列表，您可以添加代码以生成数据？简洁、美观地显示计数器
和itertuples