Nlp 在google diff match补丁中执行diff时如何忽略某些字符?
我用它来比较自然语言中的纯文本 如何使google diff match补丁忽略某些字符? (有些细微的差别我不在乎。) 例如,给定text1:Nlp 在google diff match补丁中执行diff时如何忽略某些字符?,nlp,diff,text-processing,Nlp,Diff,Text Processing,我用它来比较自然语言中的纯文本 如何使google diff match补丁忽略某些字符? (有些细微的差别我不在乎。) 例如,给定text1: give me a cup of bean-milk. Thanks. 和文本2: please give mom a cup of bean milk! Thank you. (请注意,“谢谢”前面有两个空格字符。) google diff match修补程序输出如下内容: [please] give m(e)[om] a cup of bea
give me a cup of bean-milk. Thanks.
和文本2:
please give mom a cup of bean milk! Thank you.
(请注意,“谢谢”前面有两个空格字符。)
google diff match修补程序输出如下内容:
[please] give m(e)[om] a cup of bean(-)[ ]milk(.)[!] Thank(s)[ you].
google diff match补丁似乎只忽略了不同数量的空格
我如何告诉google diff match补丁也忽略像[-.!]
这样的字符
预期结果将是
[please] give m(e)[om] a cup of bean-milk. Thank(s)[ you].
谢谢。google diff match修补程序可以输出元组列表 第一个元素指定它是插入(1)、删除(-1)还是相等(0)。第二个元素指定受影响的文本 例如:
diff_main("Good dog", "Bad dog") => [(-1, "Goo"), (1, "Ba"), (0, "d dog")]
因此,我们只需要过滤这个列表
Python中的示例代码:
Ignored_marks = re.compile('[ ,\.;:!\'"?-]+$')
def unmark_minor_diffs(diffs): #diffs are list of tuples produced by google-diff-match-patch
cooked_diffs = []
for (op, data) in diffs:
if not Ignored_marks.match(data):
cooked_diffs.append((op, data))
else:
if op in (0, -1):
cooked_diffs.append((0, data))
return cooked_diffs