Regex 使用正则表达式查找文本中的重复打印错误_Regex

Regex 使用正则表达式查找文本中的重复打印错误

regex

Regex 使用正则表达式查找文本中的重复打印错误,regex,Regex,是否有可能在文本中找到所有重复的印刷错误（在我的例子中是LaTeX源），例如： ... The Lagrangian that that includes this potential ... ... This is confimided by the the theorem of ... 使用正则表达式使用您最喜欢的工具（sed、grep）/语言（python、perl等）使用带有egrep-w和regexp（\w+）\s+\1的反向引用： $ echo "The Lagrangian t

是否有可能在文本中找到所有重复的印刷错误（在我的例子中是LaTeX源），例如：

... The Lagrangian that that includes this potential ...
... This is confimided by the the theorem of ...

使用正则表达式

使用您最喜欢的工具（sed、grep）/语言（python、perl等）

使用带有

egrep-w

和regexp

（\w+）\s+\1的反向引用：
$ echo "The Lagrangian that that includes this potential" | egrep -ow "(\w+)\s\1"
that that

$ echo "This is confimided by the the theorem of" | egrep -ow "(\w+)\s+\1"
the the

注意：-o
选项显示行中唯一匹配的部分，该部分对于演示实际匹配的内容非常有用，您可能希望删除该选项并改用--color
。-w
选项对于匹配整个单词很重要，否则是
将在中匹配这是con..

(\w+) # Matches & captures one or more word characters ([A-Za-z0-9_])
\s+   # Match one or more whitespace characters 
\1    # The last captured word  

使用egrep-w--color”（\w+\s+\1）文件的好处是可以清楚地突出显示潜在的错误重复单词，替换可能并不明智，因为许多正确的示例，如reggae raggae sauce
或美丽的一天
都会被更改
 使用带有egrep-w
和regexp（\w+\s+\1
）的反向引用：
$ echo "The Lagrangian that that includes this potential" | egrep -ow "(\w+)\s\1"
that that

$ echo "This is confimided by the the theorem of" | egrep -ow "(\w+)\s+\1"
the the

注意：-o
选项显示行中唯一匹配的部分，该部分对于演示实际匹配的内容非常有用，您可能希望删除该选项并改用--color
。-w
选项对于匹配整个单词很重要，否则是
将在中匹配这是con..

(\w+) # Matches & captures one or more word characters ([A-Za-z0-9_])
\s+   # Match one or more whitespace characters 
\1    # The last captured word  

使用egrep-w--color”（\w+\s+\1）文件的好处是可以清楚地突出显示潜在的错误重复单词，替换可能并不明智，因为许多正确的示例，如reggae raggae sauce
或美丽的一天
都会被更改
 这个JavaScript示例可以工作：
var s = '... The Lagrangian that that includes this potential ... This is confimided by the the theorem of ...'
var result = s.match(/\b(\w+)\s\1\b/gi)

结果:
["that that", "the the"];

正则表达式：
/\s(\w+)\s\1/gi

# /     --> Regex start,
# \b    --> A word boundary,
# (\w+) --> Followed by a word, grouped,
# \s    --> Followed by a space,
# \1    --> Followed by the word in group 1,
# \b    --> Followed by a word boundary,
# /gi   --> End regex, (g)lobal flag, case (i)nsensitive flag.

添加单词边界是为了防止正则表达式匹配字符串，如“hot hotel”
或“nice ice”
此JavaScript示例工作：
var s = '... The Lagrangian that that includes this potential ... This is confimided by the the theorem of ...'
var result = s.match(/\b(\w+)\s\1\b/gi)

结果:
["that that", "the the"];

正则表达式：
/\s(\w+)\s\1/gi

# /     --> Regex start,
# \b    --> A word boundary,
# (\w+) --> Followed by a word, grouped,
# \s    --> Followed by a space,
# \1    --> Followed by the word in group 1,
# \b    --> Followed by a word boundary,
# /gi   --> End regex, (g)lobal flag, case (i)nsensitive flag.

添加单词边界是为了防止正则表达式匹配字符串，如“hot hotel”
或“nice ice”
尝试以下操作：
grep -E '\b(\w+)\s+\1\b'  myfile.txt

试试这个：
grep -E '\b(\w+)\s+\1\b'  myfile.txt

Python中显示如何删除重复单词的示例：
In [1]: import re

In [2]: s1 = '... The Lagrangian that that includes this potential ...'

In [3]: s2 = '... This is confimided by the the theorem of ...'

In [4]: regex = r'\b(\w+)\s+\1\b'

In [5]: re.sub(regex, '\g<1>', s1)
Out[5]: '... The Lagrangian that includes this potential ...'

In [6]: re.sub(regex, '\g<1>', s2)
Out[6]: '... This is confimided by the theorem of ...'

[1]中的：导入re
在[2]中：s1='。。。包含这个势的拉格朗日函数…'
在[3]中：s2='。。。这是由……定理所证实的
在[4]中：regex=r'\b（\w+）\s+\1\b'
[5]中：re.sub（regex，'\g'，s1）
出[5]：“。。。包含这个势的拉格朗日函数…'
[6]中：re.sub（regex，'\g'，s2）
出[6]：“。。。这是由……定理证实的
一个Python示例，演示如何删除重复的单词：
In [1]: import re

In [2]: s1 = '... The Lagrangian that that includes this potential ...'

In [3]: s2 = '... This is confimided by the the theorem of ...'

In [4]: regex = r'\b(\w+)\s+\1\b'

In [5]: re.sub(regex, '\g<1>', s1)
Out[5]: '... The Lagrangian that includes this potential ...'

In [6]: re.sub(regex, '\g<1>', s2)
Out[6]: '... This is confimided by the theorem of ...'

[1]中的：导入re
在[2]中：s1='。。。包含这个势的拉格朗日函数…'
在[3]中：s2='。。。这是由……定理所证实的
在[4]中：regex=r'\b（\w+）\s+\1\b'
[5]中：re.sub（regex，'\g'，s1）
出[5]：“。。。包含这个势的拉格朗日函数…'
[6]中：re.sub（regex，'\g'，s2）
出[6]：“。。。这是由……定理证实的
。程序员成功地使用了正则表达式。程序员成功地使用了正则表达式。你的正则表达式将与“Tha sand and sea”匹配@Bohemian:我注意到了，正在修复。编辑：修正。你的正则表达式将匹配“沙与海”@Bohemian:我注意到，修正。编辑：修正。你的正则表达式将匹配“沙与海”@Bohemian你读过我的答案吗？grep
的-w
选项在这里很重要。我的解决方案与您的示例不匹配，这是OP示例中的..
。好的，但它是否也与“iPhone和android”匹配？@Bohemian否，正如我的答案、我的答案评论和您的答案评论中明确指出的那样。您的正则表达式与“Tha sand and sea”匹配。@Bohemian您读过我的答案吗？grep
的-w
选项在这里很重要。我的解决方案与您的示例不匹配，这是OP示例中的..
。好的，但它是否也与“iPhone和android”不匹配？@Bohemian No，正如我在回答中明确指出的，在对我的答案的评论和对你的答案的评论中。只是好奇：什么是\b
？@Cerbrus\b
代表单词边界。这可以通过使用grep的-w
选项来实现-强制模式只匹配整个单词
哦，太好了，这正是我需要的！不知道这些：PYou在这里使用ER功能，所以您回答“不要按原样工作！”！好奇：什么是\b
？@Cerbrus\b
代表单词边界。这可以通过使用grep的-w
选项来实现-强制模式只匹配整个单词
哦，那很好，这正是我需要的！不知道这些：PYou在这里使用ER功能，所以您回答“不要按原样工作！”！你需要尊重单词的边界，注意单词是在s2
@sudo\O中被删除了是的，我刚刚意识到并修复了它。谢谢。很好，但是盲目替换可能不是最好的选择，因为你可以消除误报，有很多可以接受的重复单词的情况。@sudo_O我知道，从语言学的角度来看，这不是一个好主意。但是我相信OP知道这一点。谢谢，顺便说一句，我认为自动删除重复的单词太危险了，有时它们是正确的：“衰减为伽马-伽马耦合”你需要尊重单词边界，注意单词是在s2
@sudo\O中被删除是的，我刚刚意识到并修复了它。谢谢。很好，但是盲目替换可能不是最好的选择，因为你可以消除误报，有很多可以接受的重复单词的情况。@sudo_O我知道，从语言学的角度来看，这不是一个好主意。但我相信OP知道这一点。谢谢，顺便说一句，我认为自动删除重复的单词太危险了，有时它们是正确的：“衰变为伽马-伽马耦合”