Python 在文本中查找URL并将其替换为域名_Python_Regex_Url

Python 在文本中查找URL并将其替换为域名

python regex url

Python 在文本中查找URL并将其替换为域名,python,regex,url,Python,Regex,Url,我正在从事一个NLP项目，我想用域名替换文本中的所有URL，以简化我的语料库例如： Input: Ask questions here https://stackoverflow.com/questions/ask Output: Ask questions here stackoverflow.com 此时，我正在查找具有以下RE的URL： urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', text) 然后我对它们进

我正在从事一个NLP项目，我想用域名替换文本中的所有URL，以简化我的语料库

例如：

Input: Ask questions here https://stackoverflow.com/questions/ask
Output: Ask questions here stackoverflow.com

此时，我正在查找具有以下RE的URL：

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', text)

然后我对它们进行迭代以获得域名：

doms = [re.findall(r'^(?:https?:)?(?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n]+)',url) for url in urls]

然后我简单地用它的dom替换每个URL

这不是一个最佳的方法，我想知道是否有人有更好的解决这个问题的办法

您可以使用

re.sub

：

import re
s = 'Ask questions here https://stackoverflow.com/questions/ask, new stuff here https://stackoverflow.com/questions/, Final ask https://stackoverflow.com/questions/50565514/find-urls-in-text-and-replace-them-with-their-domain-name mail server here mail.inbox.com/whatever'
new_s = re.sub('https*://[\w\.]+\.com[\w/\-]+|https*://[\w\.]+\.com|[\w\.]+\.com/[\w/\-]+', lambda x:re.findall('(?<=\://)[\w\.]+\.com|[\w\.]+\.com', x.group())[0], s)

您可以使用

re.sub

：

import re
s = 'Ask questions here https://stackoverflow.com/questions/ask, new stuff here https://stackoverflow.com/questions/, Final ask https://stackoverflow.com/questions/50565514/find-urls-in-text-and-replace-them-with-their-domain-name mail server here mail.inbox.com/whatever'
new_s = re.sub('https*://[\w\.]+\.com[\w/\-]+|https*://[\w\.]+\.com|[\w\.]+\.com/[\w/\-]+', lambda x:re.findall('(?<=\://)[\w\.]+\.com|[\w\.]+\.com', x.group())[0], s)

您还可以匹配一个模式

http\S+

，该模式以http开头，然后不匹配与url匹配的空白。解析url并返回主机名部分：

import re
from urllib.parse import urlparse

subject = "Ask questions here https://stackoverflow.com/questions/ask and here https://stackoverflow.com/questions/"
print(re.sub("http\S+", lambda match: urlparse(match.group()).hostname, subject))

编辑：如果字符串可以以http或www开头，则可以使用：

您还可以匹配一个模式

http\S+

，该模式以http开头，然后不匹配空格来匹配url。解析url并返回主机名部分：

import re
from urllib.parse import urlparse

subject = "Ask questions here https://stackoverflow.com/questions/ask and here https://stackoverflow.com/questions/"
print(re.sub("http\S+", lambda match: urlparse(match.group()).hostname, subject))

编辑：如果字符串可以以http或www开头，则可以使用：

如果url有参数，例如@m33n，则此操作失败。它现在应该可以工作了。请看我最近的编辑。我恐怕它没有：可能会使它失败，因为这两倍point@m33n如果字符串中包含您刚刚发布的url，那么您希望从该字符串中得到什么输出？I输入为“https://mail.inbox.com/whatever”，我希望得到“mail.inbox.com”（我只是添加了空格以避免url格式）如果url有参数，例如@m33n，则此操作将失败。它现在应该可以工作了。请看我最近的编辑。我恐怕它没有：可能会使它失败，因为这两倍point@m33n如果字符串中包含您刚刚发布的url，那么您希望从该字符串中得到什么输出？I输入为“https://mail.inbox.com/whatever”，我希望得到“mail.inbox.com”（我只是添加了空格以避免url格式）url可以以http或www开头url可以以http或www开头