Awk 如何在OCRD文本中分离错误组合的单词？_Awk_Sed_Ocr_Text Processing

Awk 如何在OCRD文本中分离错误组合的单词？

awk sed

Awk 如何在OCRD文本中分离错误组合的单词？,awk,sed,ocr,text-processing,Awk,Sed,Ocr,Text Processing,我有一个长文档的文本，该文档由其他人进行了OCR，其中包含许多空格无法正确识别的实例，并且两个单词一起运行，例如：divisionbetween、HasReady、everyoneelse。是否有一种相对快速的方法使用awk、sed或类似的方法来查找非单词的字符串，并检查它们是否可以分离为合法的单词或者有没有其他快速的方法来修复它们？例如，我注意到Chrome能够将组合词标记为拼写错误，当你右键单击时，建议的更正几乎总是我想要的，但我不知道一种快速的方法来自动修复它们，而且有数千个谢谢 Ma

我有一个长文档的文本，该文档由其他人进行了OCR，其中包含许多空格无法正确识别的实例，并且两个单词一起运行，例如：divisionbetween、HasReady、everyoneelse。是否有一种相对快速的方法使用awk、sed或类似的方法来查找非单词的字符串，并检查它们是否可以分离为合法的单词

或者有没有其他快速的方法来修复它们？例如，我注意到Chrome能够将组合词标记为拼写错误，当你右键单击时，建议的更正几乎总是我想要的，但我不知道一种快速的方法来自动修复它们，而且有数千个

谢谢

Matt当您在修复其他人试图使用命令行工具执行此操作时，可能会产生错误，但是如果您有一个单词词典，那么您可以使用GNU awk for patsplit和multi char来执行类似操作，以防您的任何文件有DOS行结尾：

$ cat words
bar
disco
discontent
exchange
experts
foo
is
now
of
tent
winter

$ cat file
now is the freezing winter
of ExPeRtSeXcHaNgE discontent

FWIW在我的[动力不足的]笔记本电脑上用cygwin运行大约需要半秒钟。

您需要更好的工具，将字典中没有的字符串拆分到不同的位置，并比较拆分的最大可能性。您可能需要周围文本的上下文来做出更好的决定。臭名昭著的例子：ExpertsChange据我所知，文档中绝大多数此类错误似乎只是两个单词的组合。此外，Chrome拼写检查器只为ExpertsChange和专家交流提供了一个更正建议。而且它也不能纠正治疗师的错误。归根结底，我不是在寻找一个完美的解决方案，而是寻找一个能够清除数千个明显OCR错误的解决方案。谢谢@埃德蒙顿，你应该跳过字典里的单词，假设它们没有组合在一起。MattV专家性别变化是另一种解释。Chrome背后有谷歌人工智能，你的脚本不会。@karakfa，就连vim的内置拼写检查器都提供了专家交流，作为纠正expertsexchange的第一个建议，而且显然没有标记治疗师。我本来希望我可以使用一个vim宏来接受第一个拼写建议，但我尝试录制的宏似乎在我录制时暂停，选择了第一个提供的建议。哦，好吧，回到绘图板上来。

$ cat tst.awk
BEGIN {
    RS = "\r?\n"
    minSubLgth = 2
    minWordLgth = minSubLgth * 2
}
NR==FNR {
    realWords[tolower($0)]
    next
}
{
    n = patsplit($0,words,"[[:alpha:]]{"minWordLgth",}+",seps)
    printf "%s", seps[0]
    for (i=1; i<=n; i++) {
        word = words[i]
        lcword = tolower(word)
        if ( !(lcword in realWords) ) {
            found = 0
            for (j=length(lcword)-minSubLgth; j>=minSubLgth; j--) {
                head = substr(lcword,1,j)
                tail = substr(lcword,j+1)
                if ( (head in realWords) && (tail in realWords) ) {
                    found = 1
                    break
                }
            }
            word = (found ? "[[[" substr(word,1,j) " " substr(word,j+1) "]]]" : "<<<" word ">>>")
        }
        printf "%s%s", word, seps[i]
    }
    print ""
}

$ awk -f tst.awk words file
now is the <<<freezing>>> winter
of [[[ExPeRtS eXcHaNgE]]] discontent

$ cat file
I have the textof a long document that was OCRed by someoneelse that contains
a lot ofinstances where the spacingwasn't recognized properly and two words
are run together (ex: divisionbetween, hasalready, everyoneelse). Is there a
relatively quickway using awk, sed, or the like tofind strings that are not
words andcheck if they can separatedintolegitimate words?

Or is there someother quick way to fix them? Forinstance, Inotice that
Chrome is able toflag the combined words asmisspellings and when you right
click, thesuggested correction is pretty much always the oneIwant, but I
don't know a quickway to just auto-fix themall(and there are thousands).

$ awk -f tst.awk words_alpha.txt file
I have the [[[text of]]] a long document that was [[[OC Red]]] by [[[someone else]]] that contains
a lot [[[of instances]]] where the [[[spacing wasn]]]'t recognized properly and two words
are run together (ex: [[[division between]]], [[[has already]]], [[[everyone else]]]). Is there a
relatively [[[quick way]]] using awk, sed, or the like [[[to find]]] strings that are not
words [[[and check]]] if they can <<<separatedintolegitimate>>> words?

Or is there [[[some other]]] quick way to fix them? [[[For instance]]], [[[Ino tice]]] that
Chrome is able [[[to flag]]] the combined words [[[as misspellings]]] and when you right
click, [[[the suggested]]] correction is pretty much always the <<<oneIwant>>>, but I
don't know a [[[quick way]]] to just auto-fix [[[thema ll]]](and there are thousands).