Python Can'；t首次使用正则表达式捕获href标记内容_Python_Python 3.x_Regex_Web Scraping

Python Can'；t首次使用正则表达式捕获href标记内容

python python-3.x regex web-scraping

Python Can'；t首次使用正则表达式捕获href标记内容,python,python-3.x,regex,web-scraping,Python,Python 3.x,Regex,Web Scraping,我想通过在hrefhtml标签上使用regex来搜索网站的外部链接和路径但我不知道是否有比我的代码更简单的方法： import requests import re target_url = ("http://testphp.vulnweb.com/") response = requests.get(target_url) res = re.findall('href\=\"[\w.:/]+\"', response.content.decode(

我想通过在

href

html标签上使用regex来搜索网站的外部链接和路径

但我不知道是否有比我的代码更简单的方法：

import requests
import re

target_url = ("http://testphp.vulnweb.com/")

response = requests.get(target_url)
res = re.findall('href\=\"[\w.:/]+\"', response.content.decode("utf-8"))

for i in res:
    patt = re.compile("\"[.:/\w]+\"")
    not_raw = re.findall(patt, i)
    raw = re.findall("[.:/\w]+", not_raw[0])
    print(raw)

有没有一种方法可以代替使用regex 3次，从

href

标记中选择路径和链接而不捕获它？我的意思是

res

变量输出如下：

href="https://www.acunetix.com/vulnerability-scanner/"

我是否可以使用regex在

res

变量中提取URL，如下所示

https://www.acunetix.com/vulnerability-scanner/

是的，您可以使用“caputuring”和“non-capturing”匹配。例如：

re.findall(r'(?:href=")([^\"]+)(?:")', response.content.decode("utf-8"))

（？：href=“）

中的

？：

表示此部分不会作为匹配字符串的一部分返回

发件人：

（？：…）正则括号的非捕获版本。匹配括号内的任何正则表达式，但在执行匹配或稍后在模式中引用后，无法检索组匹配的子字符串

使用正则表达式解析

HTML

是一个糟糕的选择。请参阅了解原因

要获取所有

href

属性，请使用

HTML

库，如

BeautifulSoup

，然后尝试以下操作：

import requests
from bs4 import BeautifulSoup

response = requests.get("http://testphp.vulnweb.com/").content
soup = BeautifulSoup(response, "html.parser").find_all("a", href=True)
href_ = [a["href"] for a in soup if "http" in a["href"]]
print("\n".join(href_))

输出：

https://www.acunetix.com/
https://www.acunetix.com/vulnerability-scanner/
http://www.acunetix.com
https://www.acunetix.com/vulnerability-scanner/php-security-scanner/
https://www.acunetix.com/blog/articles/prevent-sql-injection-vulnerabilities-in-php-applications/
http://www.eclectasy.com/Fractal-Explorer/index.html
http://www.acunetix.com

可能是因为通常不鼓励使用正则表达式刮取HTML。请使用BeautifulSoup。