在python中用边界替换单词的方法（与正则表达式类似）_Python_Regex_Replace_Nlp

在python中用边界替换单词的方法（与正则表达式类似）

python regex replace nlp

在python中用边界替换单词的方法（与正则表达式类似）,python,regex,replace,nlp,Python,Regex,Replace,Nlp,我正在寻找python中更健壮的替换方法，因为我正在构建一个拼写检查器在ocr上下文中输入单词假设我们有以下python文本： text = """ this is a text, generated using optical character recognition. this ls having a lot of errors because the scanned pdf has too bad resolution. Unfortunately, his text is ve

我正在寻找python中更健壮的替换方法，因为我正在构建一个拼写检查器在ocr上下文中输入单词

假设我们有以下python文本：

text =  """
this is a text, generated using optical character recognition. 
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with. 
"""

很容易认识到，正确的短语应该是“这是一个文本”，而不是“他的是一个文本”。如果我做text.replace（'his'，'this'），那么我会为此替换每一个'his'，所以我会得到像“ttthis”是文本这样的错误。当我做替换的时候。我想替换整个单词“this”，而不是他的或这个。为什么不试试这个

word_to_replace='his'
corrected_word = 'this'
corrected_text = re.sub('\b'+word_to_replace+'\b',corrected_word,text)
corrected_text

太棒了，我们做到了，但问题是。。。如果要更正的单词包含像“|”这样的特殊字符怎么办。例如 “灯亮”而不是“灯合一”。相信我，这件事发生在我身上，在那种情况下，潜艇是一场灾难。问题是，你遇到过同样的问题吗？有什么办法解决这个问题吗？替换是最重要的稳健的选择。我尝试了text.replace（“”+word_to_replace+“”，“”+word_to_replace+“”），这解决了很多问题，但仍然有“his is a text”这样的短语的问题，因为替换在这里不起作用，因为“his”在句子的开头而不是“这个”的“他的”

python中是否有任何替换方法可以像regexs\b word\u那样将整个单词替换为\u correct\b

作为输入

几天后，我解决了我遇到的问题。我希望这可以帮助别人。如果你有任何问题或什么，请告诉我


text =  """
this is a text, generated using optical character recognition. 
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with. 
"""


# Asume you already have corrected your word via ocr 
# and you just have to replace it in the text (I did it with my ocr spellchecker)
# So we get the following word2correct and corrected_word (word after spellchecking system)
word2correct = 'his'
corrected_word = 'this'

#
# now we replace the word and the its context
def context_replace(old_word,new_word,text):
    # Match word between boundaries \\b\ using regex. This will capture his and its context but not this  and its context
    phrase2correct = re.findall('.{1,10}'+'\\b'+word2correct+'\\b'+'.{1,10}',text)[0]
    # Once you matched the context, input the new word 
    phrase_corrected = phrase2correct.replace(word2correct,corrected_word)
    # Now replace  the old phrase (phrase2correct) with the new one *phrase_corrected
    text = text.replace(phrase2correct,phrase_corrected)
    return text

测试功能是否工作

print(context_replace(old_word=word2correct,new_word=corrected_word,text=text))

输出：

this is a text, generated using optical character recognition. 
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, this text is very difficult to work with.

这是为了我的目的。我希望这对其他人有帮助。

我不认为有一个简单的函数可以解决您所说的问题。OCR规范化是一个巨大的领域！不过，我建议您调查，这对OCR问题有很大帮助。非常感谢您的回答，我将查看该库。无需感谢，但如果有帮助，请投票支持我的回答。