Python 如何从HTML标题中获取带引号的字符串?
给定此HTML片段,如何使用python软件包请求或xlml查找href=之后的带引号的字符串Python 如何从HTML标题中获取带引号的字符串?,python,html,web-scraping,Python,Html,Web Scraping,给定此HTML片段,如何使用python软件包请求或xlml查找href=之后的带引号的字符串 <dl> <dt><a href="oq-phys.htm"> <b>Physics and Astronomy</b></a> <dt><a href="oq-math.htm"> <b>Mathematics</b></a&g
<dl>
<dt><a href="oq-phys.htm">
<b>Physics and Astronomy</b></a>
<dt><a href="oq-math.htm">
<b>Mathematics</b></a>
<dt><a href="oq-life.htm">
<b>Life Sciences</b></a>
<dt><a href="oq-tech.htm">
<b>Technology</b></a>
<dt><a href="oq-geo.htm">
<b>Earth and Environmental Science</b></a>
</dl>
对于上述示例,假设我们有包含上述代码片段的html\u字符串
import requests
import lxml.etree as LH
html_string = LH.fromstring(requests.get('http://openquestions.com').text)
对于html_string.xpath('//a')中的引号_链接:
打印(quoted_link.attrib['href'],quoted_link.text_content())
查找href=
短请求
+美化组
解决方案:
import requests, bs4
soup = bs4.BeautifulSoup(requests.get('http://.openquestions.com').content, 'html.parser')
hrefs = [a['href'] for a in soup.select('dl dt a')]
print(hrefs)
输出:
['oq-phys.htm', 'oq-math.htm', 'oq-life.htm', 'oq-tech.htm', 'oq-geo.htm', 'oq-map.htm', 'oq-about.htm', 'oq-howto.htm', 'oqc/oqc-home.htm', 'oq-indx.htm', 'oq-news.htm', 'oq-best.htm', 'oq-gloss.htm', 'oq-quote.htm', 'oq-new.htm']
有很多方法可以剥这只猫的皮。这里有一个
请求
/lxml
解决方案,它不包括用于循环的(显式):
import requests
from lxml.html import fromstring
req = requests.get('http://www.openquestions.com')
resp = fromstring(req.content)
hrefs = resp.xpath('//dt/a/@href')
print(hrefs)
编辑
我为什么这样写:
- 比起CSS选择器,我更喜欢XPath
- 很快
基准:
import requests,bs4
from lxml.html import fromstring
import timeit
req = requests.get('http://www.openquestions.com').content
def myfunc() :
resp = fromstring(req)
hrefs = resp.xpath('//dl/dt/a/@href')
print("Time for lxml: ", timeit.timeit(myfunc, number=100))
##############################################################
resp2 = requests.get('http://www.openquestions.com').content
def func2() :
soup = bs4.BeautifulSoup(resp2, 'html.parser')
hrefs = [a['href'] for a in soup.select('dl dt a')]
print("Time for beautiful soup:", timeit.timeit(func2, number=100))
输出:
('Time for lxml: ', 0.09621267095780464)
('Time for beautiful soup:', 0.8594218329542824)
该代码段来自以下python代码:page=requests.get(')print(page.text)您询问请求和xml时,好像它们是类似的东西,这让我很困惑。您是想获取html页面的内容,还是想解析内容并找到特定部分?我可能在问题中遗漏了一些重要信息,但结果是一个错误。谢谢你的努力!