Python 如何通过nltk语法集迭代每个单词,并将拼写错误的单词存储在单独的列表中?
我试图获取一个包含消息的文本文件,并通过NLTK wordnet synset函数迭代每个单词。我想这样做是因为我想创建一个拼写错误的单词列表。例如,如果我这样做:Python 如何通过nltk语法集迭代每个单词,并将拼写错误的单词存储在单独的列表中?,python,iteration,nltk,python-3.5,wordnet,Python,Iteration,Nltk,Python 3.5,Wordnet,我试图获取一个包含消息的文本文件,并通过NLTK wordnet synset函数迭代每个单词。我想这样做是因为我想创建一个拼写错误的单词列表。例如,如果我这样做: wn.synsets('dog') 我得到输出: [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('a
wn.synsets('dog')
我得到输出:
[Synset('dog.n.01'),
Synset('frump.n.01'),
Synset('dog.n.03'),
Synset('cad.n.01'),
Synset('frank.n.02'),
Synset('pawl.n.01'),
Synset('andiron.n.01'),
Synset('chase.v.01')]
[]
现在,如果单词拼写错误,如下所示:
wn.synsets('doeg')
我得到输出:
[Synset('dog.n.01'),
Synset('frump.n.01'),
Synset('dog.n.03'),
Synset('cad.n.01'),
Synset('frank.n.02'),
Synset('pawl.n.01'),
Synset('andiron.n.01'),
Synset('chase.v.01')]
[]
如果返回一个空列表,我希望将拼写错误的单词保存在另一个列表中,就像这样,同时继续遍历文件的其余部分:
mispelled_words = ['doeg']
我不知道如何做到这一点,下面是我的代码,我需要在变量“chat\u message\u tokenize”之后进行迭代。名称路径是我要删除的单词:
import nltk
import csv
from nltk.tag import pos_tag
from nltk.corpus import wordnet as wn
from nltk.stem.snowball import SnowballStemmer
def text_function():
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
# Read in chat messages and names files
chat_path = 'filepath.csv'
try:
with open(chat_path) as infile:
chat_messages = infile.read()
except Exception as error:
print(error)
return
name_path = 'filepath.txt'
try:
with open(names_path) as infile:
names = infile.read()
except Exception as error:
print(error)
return
chat_messages = chat_messages.split('Chats:')[1].strip()
names = names.split('Name:')[1].strip().lower()
chat_messages_tokenized = nltk.word_tokenize(chat_messages)
names_tokenized = nltk.word_tokenize(names)
# adding part of speech(pos) tag and dropping proper nouns
pos_drop = pos_tag(chat_messages_tokenized)
chat_messages_tokenized = [SnowballStemmer('english').stem(word.lower()) for word, pos in pos_drop if pos != 'NNP' and word not in names_tokenized]
for chat_messages_tokenized
if not wn.synset(chat_messages_tokenized):
print('empty list')
if __name__ == '__main__':
text_function()
# for s in wn.synsets('dog'):
# lemmas = s.lemmas()
# for l in lemmas:
# if l.name() == stemmer:
# print (l.synset())
csv_path ='OutputFilePath.csv'
try:
with open(csv_path, 'w') as outfile:
writer = csv.writer(outfile)
for word in chat_messages_tokenized:
writer.writerow([word])
except Exception as error:
print(error)
return
if __name__ == '__main__':
text_function()
先谢谢你 您的解释中已经有伪代码,您可以按照您的解释进行编码,如下所示:
misspelled_words = [] # The list to store misspelled words
for word in chat_messages_tokenized: # loop through each word
if not wn.synset(word): # if there is no synset for this word
misspelled_words.append(word) # add it to misspelled word list
print(misspelled_words)
您只需循环遍历单词,并检查返回的列表是否为空,如果为空,是否将其放入列表中?看起来您已经知道了函数的逻辑,但不知道如何编写代码?但是你的解释实际上已经描述了关于如何编码的伪代码。你的另一个问题的可能的重复本质上与这个问题相似,但是这个问题有更多的代码。