Python正则表达式删除字符串中的URL和域名_Python_Regex

Python正则表达式删除字符串中的URL和域名

python regex

Python正则表达式删除字符串中的URL和域名,python,regex,Python,Regex,我正在寻找一个正则表达式来删除字符串中的每个url或域名，以便： string='this is my content domain.com more content http://domain2.org/content and more content domain.net/page' 变成 'this is my content more content and more content' 移除最常见的TLD对我来说已经足够了，所以我尝试了 string = re.sub(r'\w+(.

我正在寻找一个正则表达式来删除字符串中的每个url或域名，以便：

string='this is my content domain.com more content http://domain2.org/content and more content domain.net/page'

变成

'this is my content more content and more content'

移除最常见的TLD对我来说已经足够了，所以我尝试了

string = re.sub(r'\w+(.net|.com|.org|.info|.edu|.gov|.uk|.de|.ca|.jp|.fr|.au|.us|.ru|.ch|.it|.nel|.se|.no|.es|.mil)\s?','',string)

但这不仅删除了URL，而且删除了太多内容。正确的语法是什么？

您应该转义所有这些点，或者更好的做法是，将点移到组外并转义一次，您还可以从not space直到not space进行捕获，如下所示：

re.sub(r'[\S]+\.(net|com|org|info|edu|gov|uk|de|ca|jp|fr|au|us|ru|ch|it|nel|se|no|es|mil)[\S]*\s?','',string)

以下内容：

'这是我的内容域.com更多内容http://domain2.org/content 还有更多内容domain.net/page thingynet stuffocom'

变成：

这是另一种解决方案：

import re
f = open('test.txt', 'r')
content = f.read()
pattern = r"[^\s]*\.(com|org|net)\S*"
result = re.sub(pattern, '', content)
print(result)

输入：

this is my content domain.com more content http://domain2.org/content and more content domain.net/page' and https://www.foo.com/page.php

输出：

this is my content  more content  and more content  and

当然，

匹配任何字符。

this is my content  more content  and more content  and