Nlp 关于拼写检查的文献?

Nlp 关于拼写检查的文献?,nlp,machine-learning,spell-checking,Nlp,Machine Learning,Spell Checking,我想知道是否有一个关于如何实现拼写检查器的很好的文献列表。我能找到的一个例子是彼得·诺维格(Peter Norvig)的《如何编写拼写更正器》——非常不现实 我感兴趣的几件事是: 在不借助字典的情况下构建拼写检查器(或者通过使用现有的小体、N-gram转储,例如Google NGram转储) 上下文拼写检查 引用下面的链接 How does it Work? The Basic Model The basic technology works as follows: The documents

我想知道是否有一个关于如何实现拼写检查器的很好的文献列表。我能找到的一个例子是彼得·诺维格(Peter Norvig)的《如何编写拼写更正器》——非常不现实

我感兴趣的几件事是:

  • 在不借助字典的情况下构建拼写检查器(或者通过使用现有的小体、N-gram转储,例如Google NGram转储)
  • 上下文拼写检查
    • 引用下面的链接

      How does it Work? The Basic Model The basic technology works as follows: The documents that the search engine is providing access to are added both to the search index and a language model. The language model stores seen phrases and maintains statistics about them. When a query is submitted, the src/QuerySpellCheck.java class looks for possible character edits, namely substitutions, insertions, replacements, transpositions, and deletions, that make the query a better fit for the lanaguage model. So if you type 'Gretski' as a query, and the underlying data is data from rec.sport.hockey, the language model will be much more familliar with the mildly edited 'Gretzky' and suggests it as an alternative. Domain Sensitivity The big advantage of this approach over dictionary-based spell checking is that the corrections are motivated by data in the search index. So "trt" will be corrected to "tort" in a legal domain, "tart" in a cooking domain, and "TRt" in a bio-informatics domain. On Google, there is no suggested correction, presumably because of web domains "trt.com", Thessaly Radio Television as well as Turkiye Radyo Televizyon, both aka TRT, etc. Context-Sensitive Correction Both Yahoo and Google perform context-sensitive correction. For instance, the query frod (an Old English term from the German meaning wise or experienced) has a suggested correction of ford (the automotive company, among others), whereas the query frod baggins has the corrected query frodo baggins (a 20th century English fictional character). That's the Yahoo behavior. Google doesn't correct frod baggins, even though there are about 785 hits for it versus 820,000 for Frodo Baggins. On the other hand, Google does correct frdo and frdo baggins. Amazon behaves similarly, but MSN corrects frd baggins to ford baggins rather than frodo baggins. LingPipe's model supports exactly this kind of context-sensitive correction. 它是如何工作的? 基本模型 基本技术的工作原理如下:搜索引擎提供访问的文档被添加到搜索索引和语言模型中。语言模型存储看到的短语并维护它们的统计信息。提交查询时,src/QuerySpellCheck.java类会查找可能的字符编辑,即替换、插入、替换、换位和删除,以使查询更适合Language模型。因此,如果您键入“Gretski”作为查询,并且基础数据是rec.sport.hockey中的数据,那么语言模型将更加熟悉经过温和编辑的“Gretzky”,并建议将其作为替代。 域灵敏度 与基于词典的拼写检查相比,这种方法的最大优势在于,更正是由搜索索引中的数据驱动的。因此,“trt”在法律领域将被更正为“侵权”,在烹饪领域将被更正为“酸”,在生物信息学领域将被更正为“trt”。在谷歌上,没有建议的更正,大概是因为“trt.com”、塞萨利广播电视以及Turkiye Radyo Televizyon(又名trt)等网络域名。 上下文相关校正 雅虎和谷歌都执行上下文相关的更正。例如,query frod(德语中的一个古英语术语,意思是wise或experience)建议对福特(汽车公司等)进行更正,而query frod baggins则建议对query frodo baggins(20世纪英国虚构人物)进行更正。这就是雅虎的行为。谷歌没有纠正FrodBaggins,尽管它的点击率约为785,而FrodoBaggins的点击率为82万。另一方面,谷歌确实纠正了frdo和frdo巴金斯。亚马逊的行为类似,但MSN将frd巴金斯改为福特巴金斯,而不是佛罗多巴金斯。 LingPipe的模型正好支持这种上下文相关的更正。 引用下面的链接

      How does it Work? The Basic Model The basic technology works as follows: The documents that the search engine is providing access to are added both to the search index and a language model. The language model stores seen phrases and maintains statistics about them. When a query is submitted, the src/QuerySpellCheck.java class looks for possible character edits, namely substitutions, insertions, replacements, transpositions, and deletions, that make the query a better fit for the lanaguage model. So if you type 'Gretski' as a query, and the underlying data is data from rec.sport.hockey, the language model will be much more familliar with the mildly edited 'Gretzky' and suggests it as an alternative. Domain Sensitivity The big advantage of this approach over dictionary-based spell checking is that the corrections are motivated by data in the search index. So "trt" will be corrected to "tort" in a legal domain, "tart" in a cooking domain, and "TRt" in a bio-informatics domain. On Google, there is no suggested correction, presumably because of web domains "trt.com", Thessaly Radio Television as well as Turkiye Radyo Televizyon, both aka TRT, etc. Context-Sensitive Correction Both Yahoo and Google perform context-sensitive correction. For instance, the query frod (an Old English term from the German meaning wise or experienced) has a suggested correction of ford (the automotive company, among others), whereas the query frod baggins has the corrected query frodo baggins (a 20th century English fictional character). That's the Yahoo behavior. Google doesn't correct frod baggins, even though there are about 785 hits for it versus 820,000 for Frodo Baggins. On the other hand, Google does correct frdo and frdo baggins. Amazon behaves similarly, but MSN corrects frd baggins to ford baggins rather than frodo baggins. LingPipe's model supports exactly this kind of context-sensitive correction. 它是如何工作的? 基本模型 基本技术的工作原理如下:搜索引擎提供访问的文档被添加到搜索索引和语言模型中。语言模型存储看到的短语并维护它们的统计信息。提交查询时,src/QuerySpellCheck.java类会查找可能的字符编辑,即替换、插入、替换、换位和删除,以使查询更适合Language模型。因此,如果您键入“Gretski”作为查询,并且基础数据是rec.sport.hockey中的数据,那么语言模型将更加熟悉经过温和编辑的“Gretzky”,并建议将其作为替代。 域灵敏度 与基于词典的拼写检查相比,这种方法的最大优势在于,更正是由搜索索引中的数据驱动的。因此,“trt”在法律领域将被更正为“侵权”,在烹饪领域将被更正为“酸”,在生物信息学领域将被更正为“trt”。在谷歌上,没有建议的更正,大概是因为“trt.com”、塞萨利广播电视以及Turkiye Radyo Televizyon(又名trt)等网络域名。 上下文相关校正 雅虎和谷歌都执行上下文相关的更正。例如,query frod(德语中的一个古英语术语,意思是wise或experience)建议对福特(汽车公司等)进行更正,而query frod baggins则建议对query frodo baggins(20世纪英国虚构人物)进行更正。这就是雅虎的行为。谷歌没有纠正FrodBaggins,尽管它的点击率约为785,而FrodoBaggins的点击率为82万。另一方面,谷歌确实纠正了frdo和frdo巴金斯。亚马逊的行为类似,但MSN将frd巴金斯改为福特巴金斯,而不是佛罗多巴金斯。 LingPipe的模型正好支持这种上下文相关的更正。
      这里有一篇经典论文:。关于上下文感知错误纠正的工作较少,但有两篇论文可能值得一看,分别是和。

      这里有一篇经典论文:。关于上下文感知错误纠正的工作较少,但有两篇论文可能值得一看,分别是和。

      是什么让你认为诺维格的例子不现实?如果你给它添加一个错误模型,并把它编译成一个Levenshtein转换器,它应该是一个非常好的基线拼写检查器。是什么让你认为诺维格的例子是不现实的?如果你给它添加了一个错误模型,并将它编译成一个Levenshtein传感器,它应该是一个非常好的基线拼写检查器。虽然这个链接可以回答这个问题,但最好在这里包含答案的基本部分,并提供链接供参考。如果链接页面发生更改,仅链接的答案可能会无效。当然,我已复制粘贴了最重要的文本。虽然此链接可能会回答问题,但最好在此处包含答案的基本部分,并提供链接以供参考。如果链接页面发生更改,只有链接的答案可能无效。当然,我已经复制粘贴了最重要的文本