从Python文本文件中提取重复短语

从Python文本文件中提取重复短语,python,nltk,Python,Nltk,我有一个巨大的对话文本文件(文本块),我想将重复的短语(不止一个单词)提取到另一个文本文件中,按频率排序 输入: 输出: I don't know 7345 I want you to 5312 amazing experience 625 我正在寻找python脚本 我尝试过这个脚本,但我只能得到一个单词,从出现率最高到最低排序 from IPython import get_ipython ipy = get_ipython() if ipy is not None:

我有一个巨大的对话文本文件(文本块),我想将重复的短语(不止一个单词)提取到另一个文本文件中,按频率排序

输入:

输出:

 I don't know 7345
 I want you to 5312 
 amazing experience 625
我正在寻找python脚本


我尝试过这个脚本,但我只能得到一个单词,从出现率最高到最低排序

    from IPython import get_ipython
ipy = get_ipython()
if ipy is not None:
    ipy.run_line_magic('matplotlib', 'inline')
import collections
import pandas as pd
import matplotlib.pyplot as plt

# Read input file, note the encoding is specified here 
# It may be different in your text file
file = open('test2.txt', encoding="utf8")
a= file.read()

# Stopwords
stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['mr','mrs','one','two','said']))

# Instantiate a dictionary, and for every word in the file, 
# Add to the dictionary if it doesn't exist. If it does, increase the count.
wordcount = {}

# To eliminate duplicates, remember to split by punctuation, and use case demiliters.
for word in a.lower().split():
    word = word.replace(".","")
    word = word.replace(",","")
    word = word.replace(":","")
    word = word.replace("\"","")
    word = word.replace("!","")
    word = word.replace("“","")
    word = word.replace("‘","")
    word = word.replace("*","")
    if word not in stopwords:
        if word not in wordcount:
            wordcount[word] = 1
        else:
            wordcount[word] += 1

# Print most common word
n_print = int(input("How many most common words to print: "))
print("\nOK. The {} most common words are as follows\n".format(n_print))
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(n_print):
    print(word, ": ", count)

# Close the file
file.close()

# Create a data frame of the most common words 
# Draw a bar chart
lst = word_counter.most_common(n_print)
df = pd.DataFrame(lst, columns = ['Word', 'Count'])
df.plot.bar(x='Word',y='Count')

我认为您可以使用nltk包中的nltk.ngrams

text = 'I have been I have I like this I have never been.'

ngrams = tuple(nltk.ngrams(text.split(' '), n=2))

ngrams_count = {i : ngrams.count(i) for i in ngrams}
输出:

然后可以使用pandas/txt/json等保存它

您可以在
nltk.ngrams
中更改
n
,您的ngrams将是另一个长度

可以修改为:

text = 'I have been I have I like this I have never been.'
lenght = [2, 3, 4]
ngrams_count = {}
for n in lenght:
    ngrams = tuple(nltk.ngrams(text.split(' '), n=n))
    ngrams_count.update({' '.join(i) : ngrams.count(i) for i in ngrams})

df = pd.DataFrame(list(zip(ngrams_count, ngrams_count.values())), 
                  columns=['Ngramm', 'Count']).sort_values(['Count'], 
                                                           ascending=False)
输出:

现在我们可以输入n,然后生成一个排序的数据帧。如果需要,您可以将其另存为
df.to\u csv('file\u name.csv')
,也可以在其前面加一个头-
df.head(10)


要使用此解决方案,您应该安装nltk和pandas。

您可以使用
str.count()
并计算字符串中的短语

s = 'vash the vash the are you is he where did where did'

print('the how: {}'.format(s.count('where did')))
print('vash the: {}'.format(s.count('vash the')))

你能给我们举个你试过的例子吗?因此,这是一个帮助处理代码和问题的地方,而不是要求某人为您工作。这类问题通常通过利用regex模块中的功能来解决。提供一个输入和期望输出的示例,以获得完整的答案。我建议阅读有关马尔可夫链的内容,以了解如何处理此任务。您能否更改代码以仅列出至少重复该短语的实例twice@SyedOmer这很容易改变。只需添加
df=df[df.count>1]
。熊猫是一个很好的图书馆。例如,您还可以将计数为10的ngram提取为
df[df.count==10]
。TypeError:“>”在“method”和“int”@SyedOmer的实例之间不受支持。对不起。更改为
df=df[df.Count>1]
。因为我们有一个名为
'Count'
的列。但是
count
是dataframe=的函数)您知道如何将其更改为C吗?剧本很完美,但耗时太长。
                Ngramm  Count
0               I have      3
1            have been      1
26   this I have never      1
25    like this I have      1
24       I like this I      1
23    have I like this      1
22       I have I like      1
21       been I have I      1
20    have been I have      1
19       I have been I      1
18    have never been.      1
17        I have never      1
...
s = 'vash the vash the are you is he where did where did'

print('the how: {}'.format(s.count('where did')))
print('vash the: {}'.format(s.count('vash the')))
the how: 2
vash the: 2