如何使python不改变我的字典代码中的字符?
我的老师给了我们一项任务,要我们为除英语以外的任何语言编写拼写检查程序 所以我选择荷兰语,因为它接近英文字母如何使python不改变我的字典代码中的字符?,python,dictionary,replace,character,Python,Dictionary,Replace,Character,我的老师给了我们一项任务,要我们为除英语以外的任何语言编写拼写检查程序 所以我选择荷兰语,因为它接近英文字母 import re, collections def words(text): return re.findall('[a-z]+', text.lower()) def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] +=
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(open('dutch2.txt').read()))
alphabet = 'aäbßcdefghijklmnoöpqrstuüvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
dutch2.txt具有以下特性:
当我运行它时,输出是
*** Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 bit (Intel)] on win32. ***
>>>
>>> correct("de")
'e'
>>>
这是不正确的。。
其他字符的字母表也会改变
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(open('dutch2.txt').read()))
alphabet = 'aäbßcdefghijklmnoöpqrstuüvwxyz'
我该如何解决角色的变化
我尝试了很多,但我做不到你能让它运行吗?我还强烈建议您不要使用unicode代码点(您从索引字符串中获得的内容),而是使用graphemes。最简单的方法是
导入正则表达式;使用包的regex.findall(“\X”,“êo”)
。因此,您的re.findall('[a-z]+',text.lower())
应该变成regex.findall('\X+',text.lower())
,并且您应该用另一个命令拆分每个结果。这不是荷兰语,而是德语。