Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/288.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在列表中找到html链接地址字符串_Python - Fatal编程技术网

Python 在列表中找到html链接地址字符串

Python 在列表中找到html链接地址字符串,python,Python,我有一张名为“aList”的名单 [ "<a href='a.html?dataset=1'><tt>outputs</tt></a></td>\n", "<a href='a.html?dataset=1'><tt>outputs</tt></a></td>\n", "<a href='a.html?dataset=1'><tt>outputs&l

我有一张名为“aList”的名单

[
"<a href='a.html?dataset=1'><tt>outputs</tt></a></td>\n", 
"<a href='a.html?dataset=1'><tt>outputs</tt></a></td>\n", 
"<a href='a.html?dataset=1'><tt>outputs</tt></a></td>\n", 
"<img src='folder.gif' alt='folder'> &nbsp;<a href='catalog.html'><tt>test all files in a directory/</tt></a></td>\n", 
"<img src='/thredds/folder.gif' alt='folder'> &nbsp;<a href='enhancedcatalog.html'><tt>test enhanced catalog/</tt></a></td>\n",
"<hr size='1' noshade='noshade'><h3><a href='/abc/catalog.html'>abc</a> at <a href='http://www.abcd.com/'>csiro</a> see <a href='/abcd/serverinfo.html'> info </a><br>\n", 
"data server [version 4.6.10 - 2017-04-19t16:32:55-0600] <a href='http://www.unidata.ucar.edu/software/thredds/current/tds/reference/index.html'> documentation</a></h3>\n"
]
我试过了,但没有得到预期的结果。请给我一些建议

matching = [s for s in aList if ".html" in s]
print(matching)

您可以使用正则表达式或BeautifulSoup来获取html中的href值。这里我给出了使用正则表达式的代码。希望对你有帮助

urls=set()
for link in aList:
    urls.update(re.findall(r'href=[\'"]?([^\'" >]+)', link))
for url in urls: 
    print(url)
输出 /abcd/serverinfo.html
enhancedcatalog.html


a、 html?数据集=1
catalog.html
/abc/catalog.html


您可以使用正则表达式或BeautifulSoup来获取html中的href值。这里我给出了使用正则表达式的代码。希望对你有帮助

urls=set()
for link in aList:
    urls.update(re.findall(r'href=[\'"]?([^\'" >]+)', link))
for url in urls: 
    print(url)
输出 /abcd/serverinfo.html
enhancedcatalog.html


a、 html?数据集=1
catalog.html
/abc/catalog.html