Python 正则表达式删除多行字符串中的重复短语_Python_Regex_Eda

Python 正则表达式删除多行字符串中的重复短语

python regex

Python 正则表达式删除多行字符串中的重复短语,python,regex,eda,Python,Regex,Eda,问题是什么：我有一个多行文本，例如： 1: This is test string for my app. d 2: This is test string for my app. 3: This is test string for my app. abcd 4: This is test string for my app. 5: This is test string for my app. 6: This is test string for my app. 7: This is te

问题是什么：

我有一个多行文本，例如：

1: This is test string for my app. d
2: This is test string for my app.
3: This is test string for my app. abcd
4: This is test string for my app.
5: This is test string for my app.
6: This is test string for my app.
7: This is test string for my app. d
8: This is test string for my app.
9: This is test string for my app.
10: This is another string.

行号在这里只是为了更好的可视化，它们不是文本本身的一部分

我尝试过的：

我尝试了两个不同的正则表达式（标记始终为：

和

）：

请看这里：

及

请看这里：

它们都有不同的产出，都是好的，但并不完美

我想要实现的目标：

删除文本中所有重复的短语，但保留一个。例如，这里保留第一个“这是我的应用程序的测试字符串。”从第1行开始，在第2-9行匹配相同的短语，并保留数字10

如果我能保留最后一个匹配短语而不是第一个匹配短语，这也会对我有用。这里是赛线1-8，保持9和10

有没有办法用正则表达式实现这一点

仅供参考：稍后我将在python中使用正则表达式来细分重复项：

re.sub(r"^(.*)(?:\r?\n|\r)(?=[\s\S]*^\1$)", "", my_text, flags=re.MULTILINE)

编辑：短语的意思是让我们说3个或更多的单词。因此，匹配任何长度超过2个单词的副本。因此，第一个子系统之后的预期输出为：

This is test string for my app. d  //from line 1
This is test string for my app.    //from line 2
abcd                               //from line 3
This is another string.            //from line 10

提前谢谢

您可以使用

re.sub（r'^（（[^\n\r.]*）*）（？：（？：\r？\n |\r）\2.*），r'\1'，我的文字，标志=re.M）

看

详情：

```
^
```
行的开始（由于使用了
```
re.M
```
选项，
```
^
```
现在匹配行的开始位置）
```
（（[^\n\r.]*）.*）
```
-第1组：除点、CR和LF以外的零个或多个字符捕获到第2组中，然后是行的其余部分
```
（？：（？：\r？\n |\r）\2.*）
```
-零个或多个
- ```
（？：\r？\n |\r）
```
  -CRLF、CR或LF行结束
- ```
\2
```
  -与第2组中的文本相同
- ```
*
```
  -行的其余部分

替换为组1值。

为

这是我的应用程序的测试字符串。abcd

也是重复的？您的意思是要标识行上第一个周期之前的重复行吗<代码>^（[^\n\r.]*）\..*（？：\r？\n | \r）（？=[\s\s]*^\1\..*$）？看见（或者，如果带行其余部分的点是可选的，

^（[^\n\r.]*）（？：\..*）？（？：\r？\n | \r）（？=[\s\s]*^\1（？：\..*）？$）

）@anubhava只有重复的短语：“这是我应用程序的测试字符串。”abcd可以留下。只需在这个字符串中重复短语。无论结尾处是否有换行符或句点，请尝试

re.sub（r'^（[^\n\r.]*）.*）（？：（？：\r？\n |\r）\2.*），r'\1'，my_text，flags=re.M）

，请参见。@G43beli:您能显示您的预期输出吗？

re.sub(r"^(.*)(?:\r?\n|\r)(?=[\s\S]*^\1$)", "", my_text, flags=re.MULTILINE)

This is test string for my app. d  //from line 1
This is test string for my app.    //from line 2
abcd                               //from line 3
This is another string.            //from line 10