Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/xamarin/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用Python3从HTML锚元素提取URL?_Python_Regex_Python 3.x_Python 3.2 - Fatal编程技术网

如何使用Python3从HTML锚元素提取URL?

如何使用Python3从HTML锚元素提取URL?,python,regex,python-3.x,python-3.2,Python,Regex,Python 3.x,Python 3.2,我想从网页HTML源中提取URL。 例如: 如何提取此URL 我不懂正则表达式。另外,我不知道如何在Windows上安装Beauty Soup 4或lxml。我尝试安装此库时出错 我试过: C:\Users\admin\Desktop>python Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (In tel)] on win32 Type "help", "copyright",

我想从网页HTML源中提取URL。
例如:

如何提取此URL

我不懂正则表达式。另外,我不知道如何在Windows上安装
Beauty Soup 4
lxml
。我尝试安装此库时出错

我试过:

C:\Users\admin\Desktop>python
Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> r = re.compile('(?<=href=").*?(?=")')
>>> r.findall(url)
['/example/hello/get/9f676bac2bb3.zip']
>>> url
'<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">Download XYZ</a>'
>>> r.findall(url)[0]
'/example/hello/get/9f676bac2bb3.zip'
>>> a = "https://xyz.com"
>>> print(a + r.findall(url)[0])
https://xyz.com/example/hello/get/9f676bac2bb3.zip
>>>
C:\Users\admin\Desktop>python
Python 3.3.2(v3.3.2:d047928ae3f6,2013年5月16日,00:03:43)[MSC v.1600 32位(在
win32上的[tel)]
有关详细信息,请键入“帮助”、“版权”、“信用证”或“许可证”。
>>>进口稀土
>>>url=“”
>>>r=re.compile(')(?您可以改用内置的:

就个人而言,我更喜欢
BeautifulSoup
——它使html解析变得简单、透明和有趣


要跟随链接并下载文件,您需要创建一个完整的url,包括架构和域(这会有帮助),然后使用。示例:

>>> BASE_URL = 'http://example.com'
>>> from urllib.parse import urljoin
>>> from urllib.request import urlretrieve
>>> href = BeautifulSoup(url).a.get('href')
>>> urlretrieve(urljoin(BASE_URL, href))

UPD(对于在评论中发布的不同html):

>>来自bs4导入组
>>>数据=“”
>>>href=BeautifulSoup(数据)。查找('a',text='XYZ')。获取('href')
“/example/hello/get/9f676bac2bb3.zip”

好的。这次运行。谢谢。如何从网页中找到此代码?
示例:
UI=User-Input-CG=Automatic-Changeable-for-file-ST=STATIC(get)/UI/UI/ST/CG.zip对不起,我的英语不好。请解释一下,在安装BS4或lxml期间,您遇到了哪些错误。
C:\Users\admin\Desktop>python
Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> r = re.compile('(?<=href=").*?(?=")')
>>> r.findall(url)
['/example/hello/get/9f676bac2bb3.zip']
>>> url
'<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">Download XYZ</a>'
>>> r.findall(url)[0]
'/example/hello/get/9f676bac2bb3.zip'
>>> a = "https://xyz.com"
>>> print(a + r.findall(url)[0])
https://xyz.com/example/hello/get/9f676bac2bb3.zip
>>>
>>> import xml.etree.ElementTree as ET
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> ET.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(url).a.get('href')
'/example/hello/get/9f676bac2bb3.zip'
>>> import lxml.html
>>> lxml.html.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'
>>> BASE_URL = 'http://example.com'
>>> from urllib.parse import urljoin
>>> from urllib.request import urlretrieve
>>> href = BeautifulSoup(url).a.get('href')
>>> urlretrieve(urljoin(BASE_URL, href))
>>> from bs4 import BeautifulSoup
>>> data = '<html> <head> <body><example><example2> <a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a> </example2></example></body></head></html>'
>>> href = BeautifulSoup(data).find('a', text='XYZ').get('href')
'/example/hello/get/9f676bac2bb3.zip'