Regex 如何在python3中检测和删除字符串内部的链接_Regex_Python 3.x

Regex 如何在python3中检测和删除字符串内部的链接

regex python-3.x

Regex 如何在python3中检测和删除字符串内部的链接,regex,python-3.x,Regex,Python 3.x,我有可能（或可能不）包含链接的字符串。如果链接存在，它将被[link][/link]标记包围。我想用一些特殊的标记来替换这些部分，例如URL。并返回相应的链接示例假设函数detect\u link执行以下操作： >input= 'The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] Th

我有可能（或可能不）包含链接的字符串。如果链接存在，它将被[link][/link]标记包围。我想用一些特殊的标记来替换这些部分，例如

URL

。并返回相应的链接

示例

假设函数

detect\u link

执行以下操作：

>input= 'The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times'
>replacement_token = "URL"
>link,new_sentence = detect_link(input,replacement_token)
>link
'http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/'
>new_sentence
'The statement URL The Washington Times'

我搜索了一下，发现正则表达式可以用来实现这一点。然而，我对他们没有任何经验。有人能帮我吗

编辑

链接没有任何固定模式。它可能以http开头，也可能不以http开头。它可能以.com etc结尾，也可能不以.com etc结尾。为此，您需要一个正则表达式模式。我过去经常玩正则表达式

您可以使用该模式提取内容并替换如下内容：

import re

text = 'The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times'

# get what what matched
for mat in re.findall(r"\[link\](.*?)\[/link\]",text):
    print(mat)

# replace a match with sthm other
print( re.sub(r"\[link\](.*?)\[/link\]","[URL]",text))

输出：

http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ 

The statement [URL] The Washington Times

# greedy
[' link 1 [/link] and [link] link 2 ', 
 ' link 3 [/link] and [link] link 4 ']
# lazy
[' link 1 ', ' link 2 ', ' link 3 ', ' link 4 ']

我使用的模式是非贪婪的，因此如果多个[link][/link]部分出现在一句话中，它将不会匹配，而只匹配最短的部分：

\[link\](.*?)\[/link\]   - matches a literal [ followed by link followed by literal ]
                           with as few things before matching the endtag [/link]

如果没有非贪婪的匹配，你只会得到一个替换整个

The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] and this also [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times

而不是两个

查找所有链接：

import re
text = """
The statement [link] link 1 [/link] and [link] link 2 [/link] The Washington Times
The statement [link] link 3 [/link] and [link] link 4 [/link] The Washington Times
"""

# get what what matched
links = re.findall(r"\[link\](.*)\[/link\]",text)        # greedy pattern
links_lazy = re.findall(r"\[link\](.*?)\[/link\]",text)  # lazy pattern

输出：

http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ 

The statement [URL] The Washington Times

# greedy
[' link 1 [/link] and [link] link 2 ', 
 ' link 3 [/link] and [link] link 4 ']
# lazy
[' link 1 ', ' link 2 ', ' link 3 ', ' link 4 ']

如果文本中不包含要匹配的换行符，

（*）

不匹配换行符，则差异是显而易见的。因此，如果一个句子中有多个链接，则需要一个

（*？）

匹配，以将两者作为单个匹配，而不是将整个部分匹配。

这样可以吗@除此之外，它似乎还能工作。您知道如何将其集成到python代码中吗？使用

re

，解释如下。感谢您的回复。如果一个句子中存在多个链接，如果我想捕获所有链接，我该怎么办？@zwlayer请参阅编辑-使用惰性计算模式

（，*？）

而不是

（.*）

，这将起作用