Python-跨多个文本文件比较n-gram_Python_N Gram

Python-跨多个文本文件比较n-gram

python

Python-跨多个文本文件比较n-gram,python,n-gram,Python,N Gram,第一次海报-我是一个新的Python用户，编程技能有限。最终，我试图识别和比较同一目录中的大量文本文档中的n-gram。我的分析有点类似于剽窃检测——我想计算可以找到特定n-gram的文本文档的百分比。现在，我正在尝试一个更大问题的简单版本，尝试比较两个文本文档中的n-gram。我识别n-gram没有问题，但我很难在两个文档之间进行比较。有没有一种方法可以将n-gram存储在一个列表中，从而有效地比较两个文档中存在哪些n-gram？以下是我到目前为止所做的（请原谅这种幼稚的编码）。作为参考，我在

第一次海报-我是一个新的Python用户，编程技能有限。最终，我试图识别和比较同一目录中的大量文本文档中的n-gram。我的分析有点类似于剽窃检测——我想计算可以找到特定n-gram的文本文档的百分比。现在，我正在尝试一个更大问题的简单版本，尝试比较两个文本文档中的n-gram。我识别n-gram没有问题，但我很难在两个文档之间进行比较。有没有一种方法可以将n-gram存储在一个列表中，从而有效地比较两个文档中存在哪些n-gram？以下是我到目前为止所做的（请原谅这种幼稚的编码）。作为参考，我在下面提供了一些基本句子，而不是我在代码中实际阅读的文本文档

import nltk
from nltk.util import ngrams

text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'

n = 3
trigrams1 = ngrams(text1.split(), n)
trigrams2 = ngrams(text2.split(), n)

print(trigrams1)
for grams in trigrams1:
    print(grams)

def compare(trigrams1, trigrams2):
    for grams1 in trigrams1:
        if each_gram in trigrams2:
            print (each_gram)
    return False

谢谢大家的帮助

在

compare

功能中使用一个列表，如

common

。将每个ngram附加到此列表中，该列表是这两个三角形的公用列表，并最终返回该列表，如下所示：

>>> trigrams1 = ngrams(text1.lower().split(), n)  # use text1.lower() to ignore sentence case.
>>> trigrams2 = ngrams(text2.lower().split(), n)  # use text2.lower() to ignore sentence case.
>>> trigrams1
[('hello', 'my', 'name'), ('my', 'name', 'is'), ('name', 'is', 'jason')]
>>> trigrams2
[('my', 'name', 'is'), ('name', 'is', 'not'), ('is', 'not', 'mike')]
>>> def compare(trigrams1, trigrams2):
...    common=[]
...    for grams1 in trigrams1:
...       if grams1 in trigrams2:
...         common.append(grams1)
...    return common
... 
>>> compare(trigrams1, trigrams2)
[('my', 'name', 'is')]

我认为将ngram中的元素连接起来，列出字符串，然后进行比较可能更容易

让我们用您提供的示例来回顾这个过程

text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'

在应用nltk中的

ngrams

功能后，您将得到以下两个列表，我将其命名为

text1

和

text2

：

text1 = [('Hello', 'my', 'name'), ('my', 'name', 'is'), ('name', 'is', 'Jason')]
text2 = [('My', 'name', 'is'), ('name', 'is', 'not'), ('is', 'not', 'Mike')]

当您想要比较ngram时，您应该将所有元素小写，以免将

'my'

和

'my'

作为单独的标记，这显然是我们不想要的

下面的函数正是这样做的

def append_elements(n_gram):
    for element in range(len(n_gram)):
            phrase = ''
            for sub_element in n_gram[element]:
                    phrase += sub_element+' '
            n_gram[element] = phrase.strip().lower()
    return n_gram

因此，如果我们给它输入

text1

我们会得到

['hello my name'，'my name is'，'name is jason']

，这更容易处理

接下来我们进行

compare

函数。你认为我们可以使用一个列表来存储共性是对的。我在这里给它命名为

common

：

def compare(n_gram1, n_gram2):
    n_gram1 = append_elements(n_gram1)
    n_gram2 = append_elements(n_gram2)
    common = []
    for phrase in n_gram1:
        if phrase in n_gram2:
            common.append(phrase)
    if not common:
        return False
        # or you could print a message saying no commonality was found
    else:
        for i in common:
            print(i)

if not common

表示如果

common

列表为空，则打印消息或返回

False

现在在您的示例中，当我们运行

compare（text1，text2）

时，唯一的共性是：

>>> 
my name is
>>>

这是正确的答案

当我遇到这个旧线程时，我正在做一个与你非常相似的任务，它似乎工作得很好，只是有一个bug。我将在这里添加这个答案，以防其他人偶然发现这个问题。来自

nltk.util

的

ngrams

返回生成器对象，而不是列表。需要将其转换为列表才能使用您编写的

compare

函数。使用

lower（）

进行不区分大小写的匹配

完整示例：

import nltk
from nltk.util import ngrams

text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'

n = 3
trigrams1 = ngrams(text1.lower().split(), n)
trigrams2 = ngrams(text2.lower().split(), n)

def compare_ngrams(trigrams1, trigrams2):
    trigrams1 = list(trigrams1)
    trigrams2 = list(trigrams2)
    common=[]
    for gram in trigrams1:
        if gram in trigrams2:
            common.append(gram)
    return common

common = compare_ngrams(trigrams1, trigrams2)
print(common)

输出：

[('my', 'name', 'is')]

有没有输入文件的例子，或者其中的几行？我正在阅读的文本文档大约有1-3页。我用两个简短的句子更新了这个简单的例子，以供参考。谢谢

ngrams

返回生成器对象，而不是列表。除非先将比较功能转换为

列表

，否则比较功能将不起作用。