String 计算数据帧行中存在值A的次数、值B的次数以及值A和B的次数_String_Pandas_Dataframe_Count_Delimiter Separated Values

String 计算数据帧行中存在值A的次数、值B的次数以及值A和B的次数

string pandas dataframe

String 计算数据帧行中存在值A的次数、值B的次数以及值A和B的次数,string,pandas,dataframe,count,delimiter-separated-values,String,Pandas,Dataframe,Count,Delimiter Separated Values,我有一个数据帧“dfTags”，有140.000行（全部小写），列“tags”中逗号分隔的值的数量可以从71到1。但是列标记是一个字符串，Pandas不知道数组或列表： index tags 0 a, b, c, aa, bb, 2019 1 a, d, 18, gb 2 aa, a, dd, fb, la 3 aa, d, ddaa, b, k, l 以及一组“tagTuples”，其中包含850.000个排序元组（全部小写），由每行中的标记生成，如： (a,

我有一个数据帧“dfTags”，有140.000行（全部小写），列“tags”中逗号分隔的值的数量可以从71到1。但是列标记是一个字符串，Pandas不知道数组或列表：

index tags
0     a, b, c, aa, bb, 2019
1     a, d, 18, gb
2     aa, a, dd, fb, la
3     aa, d, ddaa, b, k, l

以及一组“tagTuples”，其中包含850.000个排序元组（全部小写），由每行中的标记生成，如：

(a, b), (b, c), (aa, c), (aa, bb), (2019, bb), (a, d), (18, d), (18, gb), (a, aa), (a, dd), (dd, fb), (fb, la), (aa, d), (d, ddaa), ...

我使用集合是因为我删除了每个只出现一次的标记，然后添加了每个创建的元组，自动删除了重复项

对于“tagTuples”中的每个元组，我需要：

e、 g.（a，b）
“标记”列中有多少行包含“a”？(三)
列“标记”中有多少行包含“a”也包含“b”？(一)
=1/3=>0,33
“标记”列中有多少行包含“b”？(二)
列“tags”中有多少行包含“b”也包含“a”？(一)
=1/2=>0,5
导致边缘权重介于ab=（0,33+0,5）*100=83%（修改的Jaccard指数）之间

然后，应将每个结果推送到数据帧dfTagTuple中

dfTagTuple = pd.DataFrame(columns=["Source", "Target", "Weight"])

其中，源=元组[0]，目标=元组[1]，权重=边权重

所以我得到了每个标签之间的边连接和边权重，以在Gephi中可视化它们，创建一个标签网络

但是标记类型为“object”，因为熊猫不知道数组。那么，当我检查行[“tags”]是否包含“a”时，如何在不计算“aa”/“ddaa”/“la”的情况下检查该公式的每个元组呢

我如何执行这4个检查，并以性能良好的方式获得每个元组的最终结果（0833..）

def calc_distance(tagLeft, tagRight):
# how many times does "a" appear in tags per row?
onlyTagLeft = ??
# # how many times does "b" appear in tags per row?
onlyTagRight = ??
# how many times does "a" and "b" appear together in tags per row?
bothTags = ??
edgeWeight = ((bothTags / onlyTagLeft) + (bothTags / onlyTagRight)) * 100
# print(tagLeft, "#", tagRight, edgeWeight)
print("{}: {}, {}: {}, bothTags: {}, weight: {}".format(tagLeft, onlyTagLeft, tagRight, onlyTagRight, bothTags,
                                                        edgeWeight))

df = pd.DataFrame([["a, b, c, aa, bb, 2019"], ["a, d, 18, gb"], ["aa, a, dd, fb, la"], ["aa, d, ddaa, b, k, l"]], columns=["tags"])
tagSet = {('aa', 'd'), ('a', 'aa'), ('a', 'd'), ('a', 'b')}

for tagTuple in tagSet:
calc_distance(tagTuple[0], tagTuple[1])

这不是一个完整的答案，但它会告诉您，对于每个Tagtuple（

tt

），tt的第一个元素出现了多少次，以及它们都出现了多少次，然后您可以进行计算

import pandas as pd
df = pd.DataFrame({'tags': [['a', 'b', 'c', 'aa'], ['a', 'd'], ['aa', 'a', 'dd']]})
tt = [('a', 'b'), ('aa', 'c')]

for t in tt:
    el_1 = t[0]
    el_2 = t[1]
    only_el_1 = df.tags.apply(lambda x: el_1 in x).sum()
    both_el = df.tags.apply(lambda x: (el_1 in x) and (el_2 in x)).sum()

    print("First element of tupple {} is contained {} times and the both elements are contained {} times".format(t,
                                                                                                                 only_el_1,
                                                                                                                 both_el))

希望有帮助

列“tags”是单个字符串，熊猫不知道列表或数组。在你的情况下，计算“a”也算“aa”。扩展我的问题以消除不确定性。