Python 在共现中追加键值

Python 在共现中追加键值,python,collections,key,append,frequency,Python,Collections,Key,Append,Frequency,我有一个文本语料库:来自一个包含各种句子和段落的文件 这是我的密码: import re import nltk from nltk.tokenize import RegexpTokenizer import math from collections import Counter with open("descriptionsample.tsv", "r") as openfile: frequency = Counter() stopwords = nltk.corpus.stopword

我有一个文本语料库:来自一个包含各种句子和段落的文件

这是我的密码:

import re
import nltk
from nltk.tokenize import RegexpTokenizer
import math
from collections import Counter
with open("descriptionsample.tsv", "r") as openfile:
frequency = Counter()
stopwords = nltk.corpus.stopwords.words('english')
tokenizer = RegexpTokenizer("[\w’]+", flags=re.UNICODE)
for line in openfile:
    words = line.lower().strip()
    words=re.sub(r'[0-9]|\~|\`|\@|\#|\$|\%|\^|\&|\*|\(|\)|\_|\+|\=|\{|\[|\}|\]|\\|\<|\,|\<|\.|\>|\?|\/|\;|\:', '',words).replace('-',' ')
    tokens = tokenizer.tokenize(words)
    tokens = [token for token in tokens if token not in stopwords]
    frequency.update(tokens)
但是,假设通过文档的第一行执行单词频率的结果是:

{'code':10,'sql':3,'python':2........}
我想做的是通过document(而不是bigram/trigram等)从元组中创建一个共现矩阵,然后在最后收集总和。本质上,将每个键的计数附加到由Key1、Key2:Key2的值创建的新元组。其中,键2甚至可以是键1

因此,在计算tsv文件每行中的字频后,我希望逐行结果如下所示:

{('code','code'):10,('code','sql)':3,('code','python'):2,('sql,'code'):10,('sql','sql'):3,('sql','python'):2,('python','code'):10,('python','sql'):3,('python','python'):2}

我想不出来。有什么帮助吗?也许我正俯瞰着另一个可以自己做这件事的图书馆

一位同事帮我弄明白了。我最初尝试了一层又一层的嵌套字典,但遍历它将是一场噩梦。因此,这是解决我的问题更简单、更有效的方法:

doc2= {
'a': 1,
'b': 2,
'c': 3,
'd': 4,
'e': 5
}

res = {}

for key1 in doc2.keys():
    for key2 in doc2.keys():
        if key1 != key2:
            res[(key1, key2)] = doc2[key2]


for key in res:
    print("[{}, {}] = {}".format(key[0], key[1], res[key]))
结果:

[b, c] = 3
[d, a] = 1
[b, a] = 1
[d, c] = 3
[e, d] = 4
[c, d] = 4
[d, e] = 5
[c, e] = 5
[e, c] = 3
[c, a] = 1
[a, d] = 4
[e, b] = 2
[a, e] = 5
[d, b] = 2
[c, b] = 2
[a, b] = 2
[e, a] = 1
[b, e] = 5
[a, c] = 3
[b, d] = 4
[b, c] = 3
[d, a] = 1
[b, a] = 1
[d, c] = 3
[e, d] = 4
[c, d] = 4
[d, e] = 5
[c, e] = 5
[e, c] = 3
[c, a] = 1
[a, d] = 4
[e, b] = 2
[a, e] = 5
[d, b] = 2
[c, b] = 2
[a, b] = 2
[e, a] = 1
[b, e] = 5
[a, c] = 3
[b, d] = 4