Python 使用字典计算列表中的单词_Python_Nlp

Python 使用字典计算列表中的单词

python nlp

Python 使用字典计算列表中的单词,python,nlp,Python,Nlp,我有一个字典列表，里面有一个单词和一些拼写错误的单词。我试图浏览字符串列表，首先计算单词的出现次数，然后计算每个拼写错误的出现次数。我曾经尝试过使用if-word-in-string，但由于许多拼写错误都包含了实际的单词本身，因此最终导致计数不正确。这里是否可以使用pythons计数器，或者regex更有意义比如我有 words = [{'word':'achieve','misspellings': ['acheive','acheiv','achiev']}, {'wo

我有一个字典列表，里面有一个单词和一些拼写错误的单词。我试图浏览字符串列表，首先计算单词的出现次数，然后计算每个拼写错误的出现次数。我曾经尝试过使用

if-word-in-string

，但由于许多拼写错误都包含了实际的单词本身，因此最终导致计数不正确。这里是否可以使用pythons

计数器

，或者regex更有意义

比如我有

words = [{'word':'achieve','misspellings':  ['acheive','acheiv','achiev']},

        {'word':'apparently','misspellings':['apparantly','apparintly']}]

我想看一个字符串列表，最后列出每个单词及其拼写错误的总数。我遇到了一些拼写错误的问题，比如Achide，当使用

if word in string

时，它会使计数混乱，因为Achide in Achide会使计数关闭。

正则表达式可能是一个很好的工具-可以帮助您避免单词中的子匹配

对于每个单词，使用

wordre=re.compile（r“\b”+word+r“\b”，re.I | re.U）

构建一个正则表达式，然后计算

re.findall（wordre，string）

您应该将拼写错误的单词映射到原始单词：

words = {'acheive':'achieve', 'achiev':'achieve','achieve':'achieve'}

s = "achiev acheive achieve"

from collections import Counter

from string import punctuation

cn = Counter()
for word in s.split():
    word = word.strip(punctuation)
    if word in words:
        wrd = words[word]
        cn[wrd] += 1

print(cn)
Counter({'achieve': 3})

您可以将其与正则表达式组合以查找字符串中的所有单词，而不是按照回答

要计算拼写错误和原始错误，只需检查单词dict lookup返回的值是否等于单词，如果是，则更新单词的原始计数，否则更新未命中计数：

words = {'acheive':'achieve', 'achiev':'achieve','achieve':'achieve',
         'apparently':'apparently','apparantly':'apparently','apparintly':'apparently'}


s = "achiev acheive achieve! 'apparently' apparintly 'apparantly?""

from collections import defaultdict
from string import punctuation

cn = defaultdict(lambda:{"orig": 0 ,"miss":0})
for word in s.split():
    word = word.strip(punctuation)
    if word in words:
        wrd = words[word]
        if wrd == word:
           cn[wrd]["orig"] += 1
        else:
            cn[wrd]["miss"] += 1
print(cn)
defaultdict(<function <lambda> at 0x7f001fb2a8c0>, 
{'apparently': {'miss': 2, 'orig': 1}, 'achieve': {'miss': 2, 'orig': 1}})

words={'acheive'：'Achive'，'Achide'：'Achide'，'Achide'：'Achide'，
‘显然’：‘显然’，‘显然’：‘显然’，‘显然’：‘显然’}
s=“Achive acheive Achive！”显然是“明显地”明显地
从集合导入defaultdict
从字符串导入标点符号
cn=defaultdict（lambda:{“orig”：0，“miss”：0}）
对于s.split（）中的单词：
word=word.strip（标点符号）
如果用文字表示：
wrd=单词[单词]
如果wrd==字：
cn[wrd][“原始”]+=1
其他：
cn[wrd][“未命中”]+=1
印刷品（中国）
defaultdict（，
{'miss'：{'miss'：2，'orig'：1}，'acreate'：{'miss'：2，'orig'：1}）

你需要提供更多的上下文。你有你正在尝试的代码和样本吗？预期的输出也会有帮助。你能举一个导致重复计数的拼写错误的单词的例子吗？你是如何决定拼写错误的？为什么不

拼写错误={'acheive'：['acheive'，'acheiv'，'Achive']}

等等？