使用beautiful soup在python中提取深度嵌套的href
我正试图提取一个嵌套非常深的href。结构如下所示:使用beautiful soup在python中提取深度嵌套的href,python,beautifulsoup,Python,Beautifulsoup,我正试图提取一个嵌套非常深的href。结构如下所示: <div id="main"> <ol> <li class> <div class> <div class> <a class> <h1 class="title entry-title"> <a href="http://wwww.link_i_want_to_
<div id="main">
<ol>
<li class>
<div class>
<div class>
<a class>
<h1 class="title entry-title">
<a href="http://wwww.link_i_want_to_extract.com">
<span class>
</h1>
</div>
</div>
</li>
我尝试了以下方法:
soup.select('li div div h1')
也
也
这些似乎都不起作用,我得到了[]
或空的.txt
文件
另外,更令人不安的是,在定义了
soup
之后,我做了print(soup)
我没有看到嵌套的类,我只看到了顶部的类,
,也做了print soup.l
并没有检索到l类。我不认为Beautifulsoup
识别l类和其他类。使用
查找第一个Decent:
soup.find('div', id="main").h1.a['href']
或者使用h1
作为锚定点:
soup.find("h1", { "class" : "title entry-title" }).a['href']
这对我有用
from bs4 import BeautifulSoup
html = '''
<div id="main">
<ol>
<li class>
<div class>
<div class>
<a class>
<h1 class="title entry-title">
<a href="http://www.link_i_want_to_extract.com">
<span class>
</h1>
</div>
</div>
</li>
<li class>
<div class>
<div class>
<a class>
<h1 class="title entry-title">
<a href="https://other_link_i_want_to_extract.net">
<span class>
</h1>
</div>
</div>
</li>
</ol>
</div>
'''
soup = BeautifulSoup(html, "lxml")
for h1 in soup.find_all('h1', class_="title entry-title"):
print(h1.find("a")['href'])
从bs4导入美化组
html=“”
'''
soup=BeautifulSoup(html,“lxml”)
对于汤中的h1。查找所有('h1',class=“title entry title”):
打印(h1.find(“a”)['href'])
一个简单的方法:
soup.select('a[href]')
或:
您有一个输入错误:
href=TRUE
,应该是href=TRUE
s = """
<div id="main">
<ol>
<li class>
<div class>
<div class>
<a class>
<h1 class="title entry-title">
<a href="http://www.link_i_want_to_extract.com">
<span class>
</h1>
</div>
</div>
</li>
<li class>
<div class>
<div class>
<a class>
<h1 class="title entry-title">
<a href="https://other_link_i_want_to_extract.net">
<span class>
</h1>
</div>
</div>
</li>
</ol>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(s, 'html.parser')
for item in soup.find_all("h1", attrs={"class" : "title entry-title"}):
for link in item.find_all('a',href=True):
print('bs link:', link['href'])
返回:
bs link: http://www.link_i_want_to_extract.com
bs link: https://other_link_i_want_to_extract.net
pq link: http://www.link_i_want_to_extract.com
pq link: https://other_link_i_want_to_extract.net
我得到了AttributeError:'NoneType'对象没有属性'a'@ppasler在你评论之前,先测试它。很好,还不知道PyQuery!
soup.select('a[href]')
soup.findAll('a', href=True)
s = """
<div id="main">
<ol>
<li class>
<div class>
<div class>
<a class>
<h1 class="title entry-title">
<a href="http://www.link_i_want_to_extract.com">
<span class>
</h1>
</div>
</div>
</li>
<li class>
<div class>
<div class>
<a class>
<h1 class="title entry-title">
<a href="https://other_link_i_want_to_extract.net">
<span class>
</h1>
</div>
</div>
</li>
</ol>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(s, 'html.parser')
for item in soup.find_all("h1", attrs={"class" : "title entry-title"}):
for link in item.find_all('a',href=True):
print('bs link:', link['href'])
from pyquery import PyQuery as pq
from lxml import etree
d = pq(s)
for link in d('h1.title.entry-title > a'):
print('pq link:', pq(link).attr('href'))
bs link: http://www.link_i_want_to_extract.com
bs link: https://other_link_i_want_to_extract.net
pq link: http://www.link_i_want_to_extract.com
pq link: https://other_link_i_want_to_extract.net