如何使用Python3从HTML锚元素提取URL？_Python_Regex_Python 3.x_Python 3.2

如何使用Python3从HTML锚元素提取URL？

python regex python-3.x

如何使用Python3从HTML锚元素提取URL？,python,regex,python-3.x,python-3.2,Python,Regex,Python 3.x,Python 3.2,我想从网页HTML源中提取URL。例如：如何提取此URL 我不懂正则表达式。另外，我不知道如何在Windows上安装Beauty Soup 4或lxml。我尝试安装此库时出错我试过： C:\Users\admin\Desktop>python Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (In tel)] on win32 Type "help", "copyright",

我想从网页HTML源中提取URL。
例如：

如何提取此URL

我不懂正则表达式。另外，我不知道如何在Windows上安装

Beauty Soup 4

或

lxml

。我尝试安装此库时出错

我试过：

C:\Users\admin\Desktop>python
Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> r = re.compile('(?<=href=").*?(?=")')
>>> r.findall(url)
['/example/hello/get/9f676bac2bb3.zip']
>>> url
'<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">Download XYZ</a>'
>>> r.findall(url)[0]
'/example/hello/get/9f676bac2bb3.zip'
>>> a = "https://xyz.com"
>>> print(a + r.findall(url)[0])
https://xyz.com/example/hello/get/9f676bac2bb3.zip
>>>

C:\Users\admin\Desktop>python
Python 3.3.2（v3.3.2:d047928ae3f6，2013年5月16日，00:03:43）[MSC v.1600 32位（在
win32上的[tel）]
有关详细信息，请键入“帮助”、“版权”、“信用证”或“许可证”。
>>>进口稀土
>>>url=“”
>>>r=re.compile（'）（？您可以改用内置的：
或
就个人而言，我更喜欢BeautifulSoup
——它使html解析变得简单、透明和有趣

要跟随链接并下载文件，您需要创建一个完整的url，包括架构和域（这会有帮助），然后使用。示例：
>>> BASE_URL = 'http://example.com'
>>> from urllib.parse import urljoin
>>> from urllib.request import urlretrieve
>>> href = BeautifulSoup(url).a.get('href')
>>> urlretrieve(urljoin(BASE_URL, href))


UPD（对于在评论中发布的不同html）：
>>来自bs4导入组
>>>数据=“”
>>>href=BeautifulSoup（数据）。查找（'a'，text='XYZ'）。获取（'href'）
“/example/hello/get/9f676bac2bb3.zip”
好的。这次运行。谢谢。如何从网页中找到此代码？
示例：UI=User-Input-CG=Automatic-Changeable-for-file-ST=STATIC（get）/UI/UI/ST/CG.zip对不起，我的英语不好。请解释一下，在安装BS4或lxml期间，您遇到了哪些错误。
C:\Users\admin\Desktop>python
Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> r = re.compile('(?<=href=").*?(?=")')
>>> r.findall(url)
['/example/hello/get/9f676bac2bb3.zip']
>>> url
'<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">Download XYZ</a>'
>>> r.findall(url)[0]
'/example/hello/get/9f676bac2bb3.zip'
>>> a = "https://xyz.com"
>>> print(a + r.findall(url)[0])
https://xyz.com/example/hello/get/9f676bac2bb3.zip
>>>

>>> import xml.etree.ElementTree as ET
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> ET.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(url).a.get('href')
'/example/hello/get/9f676bac2bb3.zip'

>>> import lxml.html
>>> lxml.html.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'

>>> BASE_URL = 'http://example.com'
>>> from urllib.parse import urljoin
>>> from urllib.request import urlretrieve
>>> href = BeautifulSoup(url).a.get('href')
>>> urlretrieve(urljoin(BASE_URL, href))

>>> from bs4 import BeautifulSoup
>>> data = '<html> <head> <body><example><example2> <a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a> </example2></example></body></head></html>'
>>> href = BeautifulSoup(data).find('a', text='XYZ').get('href')
'/example/hello/get/9f676bac2bb3.zip'