Python 从词频创建ARFF
我有一些代码,可以给我一个单词列表,以及它们在文本中出现的频率,我希望它能让代码自动将前10个单词转换成ARFF @关系字频率 @属性字串 @属性频率数字 排名前10位的数据及其频率 我正在努力解决如何用我当前的代码实现这一点Python 从词频创建ARFF,python,nltk,weka,word-frequency,arff,Python,Nltk,Weka,Word Frequency,Arff,我有一些代码,可以给我一个单词列表,以及它们在文本中出现的频率,我希望它能让代码自动将前10个单词转换成ARFF @关系字频率 @属性字串 @属性频率数字 排名前10位的数据及其频率 我正在努力解决如何用我当前的代码实现这一点 import re import nltk # Quran subset filename = 'subsetQuran.txt' # create list of lower case words word_list = re.split('\s+', file(f
import re
import nltk
# Quran subset
filename = 'subsetQuran.txt'
# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]
# create dictionary of word:frequency pairs
freq_dic = {}
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
for word in word_list2:
# remove punctuation marks
word = punctuation.sub("", word)
# form dictionary
try:
freq_dic[word] += 1
except:
freq_dic[word] = 1
print '-'*30
print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
freq_list2 = [(val, key) for key, val in freq_dic.items()]
# sort by val or frequency
freq_list2.sort(reverse=True)
freq_list3 = list(freq_list2)
# display result
for freq, word in freq_list2:
print word, freq
f = open("wordfreq.txt", "w")
f.write( str(freq_list3) )
f.close()
任何帮助都是感激的,这样做的方式真的是折磨我的大脑 我希望你不介意稍微重写一下:
import re
import nltk
from collections import defaultdict
# Quran subset
filename = 'subsetQuran.txt'
# create list of lower case words
word_list = open(filename).read().lower().split()
print 'Words in text:', len(word_list)
# remove stopwords
word_list = [w for w in word_list if w not in nltk.corpus.stopwords.words('english')]
# create dictionary of word:frequency pairs
freq_dic = defaultdict(int)
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
for word in word_list:
# remove punctuation marks
word = punctuation.sub("", word)
# increment count for word
freq_dic[word] += 1
print '-' * 30
print "sorted by highest frequency first:"
# create list of (frequency, word) tuple pairs
freq_list = [(freq, word) for word, freq in freq_dic.items()]
# sort by descending frequency
freq_list.sort(reverse=True)
# display result
for freq, word in freq_list:
print word, freq
# write ARFF file for 10 most common words
f = open("wordfreq.txt", "w")
f.write("@RELATION wordfrequencies\n")
f.write("@ATTRIBUTE word string\n")
f.write("@ATTRIBUTE frequency numeric\n")
f.write("@DATA\n")
for freq, word in freq_list[ : 10]:
f.write("'%s',%d\n" % (word, freq))
f.close()
不确定这是否有帮助,但它可能会告诉你如何为所有单词制作一个arff,然后将其编辑为只取前10名?