如何使用python查找文件中最常出现的单词对集？_Python_Python 2.7_Word Count

如何使用python查找文件中最常出现的单词对集？

python python-2.7

如何使用python查找文件中最常出现的单词对集？,python,python-2.7,word-count,Python,Python 2.7,Word Count,我的数据集如下： "485","AlterNet","Statistics","Estimation","Narnia","Two and half men" "717","I like Sheen", "Narnia", "Statistics", "Estimation" "633","MachineLearning","AI","I like Cars, but I also like bikes" "717","I like Sheen","MachineLearning", "regr

我的数据集如下：

"485","AlterNet","Statistics","Estimation","Narnia","Two and half men"
"717","I like Sheen", "Narnia", "Statistics", "Estimation"
"633","MachineLearning","AI","I like Cars, but I also like bikes"
"717","I like Sheen","MachineLearning", "regression", "AI"
"136","MachineLearning","AI","TopGear"

等等

我想找出最常出现的词对

(Statistics,Estimation:2)
(Statistics,Narnia:2)
(Narnia,Statistics)
(MachineLearning,AI:3)

这两个词可以是任意顺序，也可以是任意距离

有人能用python推荐一个可能的解决方案吗？这是一个非常大的数据集

如有任何建议，我们将不胜感激

这就是我在@275365的建议后尝试的

@275365我使用从文件读取的输入尝试了以下操作

    def collect_pairs(file):
        pair_counter = Counter()
        for line in open(file):
            unique_tokens = sorted(set(line))  
            combos = combinations(unique_tokens, 2)
            pair_counter += Counter(combos)
            print pair_counter

    file = ('myfileComb.txt')
    p=collect_pairs(file)

文本文件的行数与原始文件相同，但在特定行中只有唯一的标记。我不知道我做错了什么，因为当我运行此程序时，它将单词拆分为字母，而不是将输出作为单词的组合。当我运行这个文件时，它会输出拆分的字母，而不是预期的单词组合。我不知道我在哪里犯了错误。

根据语料库的大小，您可以从以下内容开始：

>>> from itertools import combinations
>>> from collections import Counter

>>> def collect_pairs(lines):
    pair_counter = Counter()
    for line in lines:
        unique_tokens = sorted(set(line))  # exclude duplicates in same line and sort to ensure one word is always before other
        combos = combinations(unique_tokens, 2)
        pair_counter += Counter(combos)
    return pair_counter

结果是：

>>> t2 = [['485', 'AlterNet', 'Statistics', 'Estimation', 'Narnia', 'Two and half men'], ['717', 'I like Sheen', 'Narnia', 'Statistics', 'Estimation'], ['633', 'MachineLearning', 'AI', 'I like Cars, but I also like bikes'], ['717', 'I like Sheen', 'MachineLearning', 'regression', 'AI'], ['136', 'MachineLearning', 'AI', 'TopGear']]
>>> pairs = collect_pairs(t2)
>>> pairs.most_common(3)
[(('MachineLearning', 'AI'), 3), (('717', 'I like Sheen'), 2), (('Statistics', 'Estimation'), 2)]

您是否希望这些组合中包含数字？因为你没有特别提到排除它们，所以我把它们包括在这里

编辑：使用文件对象

您在上面第一次尝试时发布的功能非常接近正常工作。您需要做的唯一一件事是将每一行（即字符串）更改为元组或列表。假设您的数据与上面发布的数据完全相同（每个术语周围都有引号，术语之间用逗号分隔），我建议您使用一个简单的修复方法：您可以使用

ast.literal\u eval

。（否则，您可能需要使用某种形式的正则表达式。）有关使用

ast.literal\u eval

的修改版本，请参见下文：

from itertools import combinations
from collections import Counter
import ast

def collect_pairs(file_name):
    pair_counter = Counter()
    for line in open(file_name):  # these lines are each simply one long string; you need a list or tuple
        unique_tokens = sorted(set(ast.literal_eval(line)))  # eval will convert each line into a tuple before converting the tuple to a set
        combos = combinations(unique_tokens, 2)
        pair_counter += Counter(combos)
    return pair_counter  # return the actual Counter object

现在您可以这样测试它：

file_name = 'myfileComb.txt'
p = collect_pairs(file_name)
print p.most_common(10)  # for example

除了数一数所有的对，你能做的不多

明显的优化是尽早删除重复的单词和同义词，执行词干分析（任何减少不同标记数量的方法都是好的！），并且只计算对

（a，b）

，其中

ais“两个半人”
一个标记，还是五个标记？这是一个标记，任何一个标记”是单个令牌吗？您尝试过什么？使用蛮力求解是一个选项，我计算唯一令牌的可能排列数，并遍历整个数据集。但即使数据集有点大，这也会失败。你说“可能……彼此之间有任何距离”是什么意思？在您的示例数据集中，（统计，TopGear）
是一对吗？或者我们可以假设只有来自同一行的单词可以配对吗？我会尽快尝试这个解决方案。您希望它如何扩展到包含10K个唯一令牌的大型数据集和一个1TB大小的文件，该文件在同一行中有多对令牌？@user3197086哇，1TB，呃？可能需要一段时间才能找到1 TB数据中存在的所有组合，每行有多个组合。您还可以探索集合交叉点，以避免单独探索每个组合。10K唯一令牌只有100M个组合，并且您不会有那么多，因为这些组合将是稀疏的。在上面的代码中，您应该改为使用组合（set（line），2）
，然后删除已排序的
。另一方面，1TB可能要求您并行化集群上的事情（取决于这是一次性任务还是正在进行的任务）。@Tom Morris提出了一些非常好的建议。我相应地更改了答案。@user3197086我用您函数的更正版本编辑了我的答案。如果您不能信任数据，需要使用正则表达式，请告诉我。