从Python文本文件中提取重复短语_Python_Nltk

从Python文本文件中提取重复短语

python

从Python文本文件中提取重复短语,python,nltk,Python,Nltk,我有一个巨大的对话文本文件（文本块），我想将重复的短语（不止一个单词）提取到另一个文本文件中，按频率排序输入：输出： I don't know 7345 I want you to 5312 amazing experience 625 我正在寻找python脚本我尝试过这个脚本，但我只能得到一个单词，从出现率最高到最低排序 from IPython import get_ipython ipy = get_ipython() if ipy is not None:

我有一个巨大的对话文本文件（文本块），我想将重复的短语（不止一个单词）提取到另一个文本文件中，按频率排序

输入：

输出：

 I don't know 7345
 I want you to 5312 
 amazing experience 625

我正在寻找python脚本

我尝试过这个脚本，但我只能得到一个单词，从出现率最高到最低排序

    from IPython import get_ipython
ipy = get_ipython()
if ipy is not None:
    ipy.run_line_magic('matplotlib', 'inline')
import collections
import pandas as pd
import matplotlib.pyplot as plt

# Read input file, note the encoding is specified here 
# It may be different in your text file
file = open('test2.txt', encoding="utf8")
a= file.read()

# Stopwords
stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['mr','mrs','one','two','said']))

# Instantiate a dictionary, and for every word in the file, 
# Add to the dictionary if it doesn't exist. If it does, increase the count.
wordcount = {}

# To eliminate duplicates, remember to split by punctuation, and use case demiliters.
for word in a.lower().split():
    word = word.replace(".","")
    word = word.replace(",","")
    word = word.replace(":","")
    word = word.replace("\"","")
    word = word.replace("!","")
    word = word.replace("â€œ","")
    word = word.replace("â€˜","")
    word = word.replace("*","")
    if word not in stopwords:
        if word not in wordcount:
            wordcount[word] = 1
        else:
            wordcount[word] += 1

# Print most common word
n_print = int(input("How many most common words to print: "))
print("\nOK. The {} most common words are as follows\n".format(n_print))
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(n_print):
    print(word, ": ", count)

# Close the file
file.close()

# Create a data frame of the most common words 
# Draw a bar chart
lst = word_counter.most_common(n_print)
df = pd.DataFrame(lst, columns = ['Word', 'Count'])
df.plot.bar(x='Word',y='Count')

我认为您可以使用nltk包中的nltk.ngrams

text = 'I have been I have I like this I have never been.'

ngrams = tuple(nltk.ngrams(text.split(' '), n=2))

ngrams_count = {i : ngrams.count(i) for i in ngrams}

输出：

然后可以使用pandas/txt/json等保存它

您可以在

nltk.ngrams

中更改

，您的ngrams将是另一个长度

可以修改为：

text = 'I have been I have I like this I have never been.'
lenght = [2, 3, 4]
ngrams_count = {}
for n in lenght:
    ngrams = tuple(nltk.ngrams(text.split(' '), n=n))
    ngrams_count.update({' '.join(i) : ngrams.count(i) for i in ngrams})

df = pd.DataFrame(list(zip(ngrams_count, ngrams_count.values())), 
                  columns=['Ngramm', 'Count']).sort_values(['Count'], 
                                                           ascending=False)

输出：

现在我们可以输入n，然后生成一个排序的数据帧。如果需要，您可以将其另存为

df.to\u csv（'file\u name.csv'）

，也可以在其前面加一个头-

df.head（10）

要使用此解决方案，您应该安装nltk和pandas。

您可以使用

str.count（）

并计算字符串中的短语

s = 'vash the vash the are you is he where did where did'

print('the how: {}'.format(s.count('where did')))
print('vash the: {}'.format(s.count('vash the')))

你能给我们举个你试过的例子吗？因此，这是一个帮助处理代码和问题的地方，而不是要求某人为您工作。这类问题通常通过利用regex模块中的功能来解决。提供一个输入和期望输出的示例，以获得完整的答案。我建议阅读有关马尔可夫链的内容，以了解如何处理此任务。您能否更改代码以仅列出至少重复该短语的实例twice@SyedOmer这很容易改变。只需添加

df=df[df.count>1]

。熊猫是一个很好的图书馆。例如，您还可以将计数为10的ngram提取为

df[df.count==10]

。TypeError:“>”在“method”和“int”@SyedOmer的实例之间不受支持。对不起。更改为

df=df[df.Count>1]

。因为我们有一个名为

'Count'

的列。但是

count

是dataframe=的函数）您知道如何将其更改为C吗？剧本很完美，但耗时太长。

                Ngramm  Count
0               I have      3
1            have been      1
26   this I have never      1
25    like this I have      1
24       I like this I      1
23    have I like this      1
22       I have I like      1
21       been I have I      1
20    have been I have      1
19       I have been I      1
18    have never been.      1
17        I have never      1
...

s = 'vash the vash the are you is he where did where did'

print('the how: {}'.format(s.count('where did')))
print('vash the: {}'.format(s.count('vash the')))

the how: 2
vash the: 2