从Python文本文件中提取重复短语
我有一个巨大的对话文本文件(文本块),我想将重复的短语(不止一个单词)提取到另一个文本文件中,按频率排序 输入: 输出:从Python文本文件中提取重复短语,python,nltk,Python,Nltk,我有一个巨大的对话文本文件(文本块),我想将重复的短语(不止一个单词)提取到另一个文本文件中,按频率排序 输入: 输出: I don't know 7345 I want you to 5312 amazing experience 625 我正在寻找python脚本 我尝试过这个脚本,但我只能得到一个单词,从出现率最高到最低排序 from IPython import get_ipython ipy = get_ipython() if ipy is not None:
I don't know 7345
I want you to 5312
amazing experience 625
我正在寻找python脚本
我尝试过这个脚本,但我只能得到一个单词,从出现率最高到最低排序
from IPython import get_ipython
ipy = get_ipython()
if ipy is not None:
ipy.run_line_magic('matplotlib', 'inline')
import collections
import pandas as pd
import matplotlib.pyplot as plt
# Read input file, note the encoding is specified here
# It may be different in your text file
file = open('test2.txt', encoding="utf8")
a= file.read()
# Stopwords
stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['mr','mrs','one','two','said']))
# Instantiate a dictionary, and for every word in the file,
# Add to the dictionary if it doesn't exist. If it does, increase the count.
wordcount = {}
# To eliminate duplicates, remember to split by punctuation, and use case demiliters.
for word in a.lower().split():
word = word.replace(".","")
word = word.replace(",","")
word = word.replace(":","")
word = word.replace("\"","")
word = word.replace("!","")
word = word.replace("“","")
word = word.replace("‘","")
word = word.replace("*","")
if word not in stopwords:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
# Print most common word
n_print = int(input("How many most common words to print: "))
print("\nOK. The {} most common words are as follows\n".format(n_print))
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(n_print):
print(word, ": ", count)
# Close the file
file.close()
# Create a data frame of the most common words
# Draw a bar chart
lst = word_counter.most_common(n_print)
df = pd.DataFrame(lst, columns = ['Word', 'Count'])
df.plot.bar(x='Word',y='Count')
我认为您可以使用nltk包中的nltk.ngrams
text = 'I have been I have I like this I have never been.'
ngrams = tuple(nltk.ngrams(text.split(' '), n=2))
ngrams_count = {i : ngrams.count(i) for i in ngrams}
输出:
然后可以使用pandas/txt/json等保存它
您可以在nltk.ngrams
中更改n
,您的ngrams将是另一个长度
可以修改为:
text = 'I have been I have I like this I have never been.'
lenght = [2, 3, 4]
ngrams_count = {}
for n in lenght:
ngrams = tuple(nltk.ngrams(text.split(' '), n=n))
ngrams_count.update({' '.join(i) : ngrams.count(i) for i in ngrams})
df = pd.DataFrame(list(zip(ngrams_count, ngrams_count.values())),
columns=['Ngramm', 'Count']).sort_values(['Count'],
ascending=False)
输出:
现在我们可以输入n,然后生成一个排序的数据帧。如果需要,您可以将其另存为df.to\u csv('file\u name.csv')
,也可以在其前面加一个头-df.head(10)
要使用此解决方案,您应该安装nltk和pandas。您可以使用
str.count()
并计算字符串中的短语
s = 'vash the vash the are you is he where did where did'
print('the how: {}'.format(s.count('where did')))
print('vash the: {}'.format(s.count('vash the')))
你能给我们举个你试过的例子吗?因此,这是一个帮助处理代码和问题的地方,而不是要求某人为您工作。这类问题通常通过利用regex模块中的功能来解决。提供一个输入和期望输出的示例,以获得完整的答案。我建议阅读有关马尔可夫链的内容,以了解如何处理此任务。您能否更改代码以仅列出至少重复该短语的实例twice@SyedOmer这很容易改变。只需添加
df=df[df.count>1]
。熊猫是一个很好的图书馆。例如,您还可以将计数为10的ngram提取为df[df.count==10]
。TypeError:“>”在“method”和“int”@SyedOmer的实例之间不受支持。对不起。更改为df=df[df.Count>1]
。因为我们有一个名为'Count'
的列。但是count
是dataframe=的函数)您知道如何将其更改为C吗?剧本很完美,但耗时太长。
Ngramm Count
0 I have 3
1 have been 1
26 this I have never 1
25 like this I have 1
24 I like this I 1
23 have I like this 1
22 I have I like 1
21 been I have I 1
20 have been I have 1
19 I have been I 1
18 have never been. 1
17 I have never 1
...
s = 'vash the vash the are you is he where did where did'
print('the how: {}'.format(s.count('where did')))
print('vash the: {}'.format(s.count('vash the')))
the how: 2
vash the: 2