在python中用边界替换单词的方法(与正则表达式类似)

在python中用边界替换单词的方法(与正则表达式类似),python,regex,replace,nlp,Python,Regex,Replace,Nlp,我正在寻找python中更健壮的替换方法,因为我正在构建一个 拼写检查器在ocr上下文中输入单词 假设我们有以下python文本: text = """ this is a text, generated using optical character recognition. this ls having a lot of errors because the scanned pdf has too bad resolution. Unfortunately, his text is ve

我正在寻找python中更健壮的替换方法,因为我正在构建一个 拼写检查器在ocr上下文中输入单词

假设我们有以下python文本:

text =  """
this is a text, generated using optical character recognition. 
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with. 
"""
很容易认识到,正确的短语应该是“这是一个文本”,而不是“他的是一个文本”。 如果我做text.replace('his','this'),那么我会为此替换每一个'his',所以我会得到像“ttthis”是文本这样的错误。 当我做替换的时候。我想替换整个单词“this”,而不是他的或这个。 为什么不试试这个

word_to_replace='his'
corrected_word = 'this'
corrected_text = re.sub('\b'+word_to_replace+'\b',corrected_word,text)
corrected_text 
太棒了,我们做到了,但问题是。。。如果要更正的单词包含像“|”这样的特殊字符怎么办。例如 “灯亮”而不是“灯合一”。相信我,这件事发生在我身上,在那种情况下,潜艇是一场灾难。 问题是,你遇到过同样的问题吗?有什么办法解决这个问题吗?替换是最重要的 稳健的选择。 我尝试了text.replace(“”+word_to_replace+“”,“”+word_to_replace+“”),这解决了很多问题,但仍然 有“his is a text”这样的短语的问题,因为替换在这里不起作用,因为“his”在句子的开头 而不是“这个”的“他的”

python中是否有任何替换方法可以像regexs\b word\u那样将整个单词替换为\u correct\b
作为输入

几天后,我解决了我遇到的问题。我希望这可以 帮助别人。如果你有任何问题或什么,请告诉我


text =  """
this is a text, generated using optical character recognition. 
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with. 
"""


# Asume you already have corrected your word via ocr 
# and you just have to replace it in the text (I did it with my ocr spellchecker)
# So we get the following word2correct and corrected_word (word after spellchecking system)
word2correct = 'his'
corrected_word = 'this'

#
# now we replace the word and the its context
def context_replace(old_word,new_word,text):
    # Match word between boundaries \\b\ using regex. This will capture his and its context but not this  and its context
    phrase2correct = re.findall('.{1,10}'+'\\b'+word2correct+'\\b'+'.{1,10}',text)[0]
    # Once you matched the context, input the new word 
    phrase_corrected = phrase2correct.replace(word2correct,corrected_word)
    # Now replace  the old phrase (phrase2correct) with the new one *phrase_corrected
    text = text.replace(phrase2correct,phrase_corrected)
    return text

测试功能是否工作

print(context_replace(old_word=word2correct,new_word=corrected_word,text=text))
输出:

this is a text, generated using optical character recognition. 
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, this text is very difficult to work with. 

这是为了我的目的。我希望这对其他人有帮助。

我不认为有一个简单的函数可以解决您所说的问题。OCR规范化是一个巨大的领域!不过,我建议您调查,这对OCR问题有很大帮助。非常感谢您的回答,我将查看该库。无需感谢,但如果有帮助,请投票支持我的回答。