使用beautiful soup在python中提取深度嵌套的href_Python_Beautifulsoup

使用beautiful soup在python中提取深度嵌套的href

python

使用beautiful soup在python中提取深度嵌套的href,python,beautifulsoup,Python,Beautifulsoup,我正试图提取一个嵌套非常深的href。结构如下所示： <div id="main"> <ol> <li class> <div class> <div class> <a class> <h1 class="title entry-title"> <a href="http://wwww.link_i_want_to_

我正试图提取一个嵌套非常深的href。结构如下所示：

<div id="main">
 <ol>
   <li class>
     <div class>
       <div class>
         <a class>
         <h1 class="title entry-title">
           <a href="http://wwww.link_i_want_to_extract.com">
           <span class>
         </h1>
        </div>
       </div>
     </li>

我尝试了以下方法：

soup.select('li div div h1')

也

这些似乎都不起作用，我得到了

[]

或空的

.txt

文件

另外，更令人不安的是，在定义了

soup

之后，我做了

print（soup）

我没有看到嵌套的类，我只看到了顶部的类，

，也做了

print soup.l

并没有检索到l类。我不认为

Beautifulsoup

识别l类和其他类。

使用

查找第一个Decent：

soup.find('div', id="main").h1.a['href']

或者使用

h1

作为锚定点：

soup.find("h1", { "class" : "title entry-title" }).a['href']

这对我有用

from bs4 import BeautifulSoup

html = '''
<div id="main">
   <ol>
      <li class>
         <div class>
            <div class>
               <a class>
               <h1 class="title entry-title">
                  <a href="http://www.link_i_want_to_extract.com">
                  <span class>
               </h1>
            </div>
         </div>
      </li>
      <li class>
         <div class>
            <div class>
               <a class>
               <h1 class="title entry-title">
                  <a href="https://other_link_i_want_to_extract.net">
                  <span class>
               </h1>
            </div>
         </div>
      </li>
   </ol>
</div>
'''

soup = BeautifulSoup(html, "lxml")
for h1 in soup.find_all('h1', class_="title entry-title"):
    print(h1.find("a")['href'])

从bs4导入美化组
html=“”




'''
soup=BeautifulSoup（html，“lxml”）
对于汤中的h1。查找所有（'h1'，class=“title entry title”）：
打印（h1.find（“a”）['href']）

一个简单的方法：

soup.select('a[href]')

或：

您有一个输入错误：

href=TRUE

，应该是

href=TRUE

s = """
<div id="main">
   <ol>
      <li class>
         <div class>
            <div class>
               <a class>
               <h1 class="title entry-title">
                  <a href="http://www.link_i_want_to_extract.com">
                  <span class>
               </h1>
            </div>
         </div>
      </li>
      <li class>
         <div class>
            <div class>
               <a class>
               <h1 class="title entry-title">
                  <a href="https://other_link_i_want_to_extract.net">
                  <span class>
               </h1>
            </div>
         </div>
      </li>
   </ol>
</div>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(s, 'html.parser')

for item in soup.find_all("h1", attrs={"class" : "title entry-title"}):
    for link in item.find_all('a',href=True):
        print('bs link:', link['href'])

bs link: http://www.link_i_want_to_extract.com
bs link: https://other_link_i_want_to_extract.net
pq link: http://www.link_i_want_to_extract.com
pq link: https://other_link_i_want_to_extract.net

我得到了AttributeError:'NoneType'对象没有属性'a'@ppasler在你评论之前，先测试它。很好，还不知道PyQuery！

soup.select('a[href]')

soup.findAll('a', href=True)

s = """
<div id="main">
   <ol>
      <li class>
         <div class>
            <div class>
               <a class>
               <h1 class="title entry-title">
                  <a href="http://www.link_i_want_to_extract.com">
                  <span class>
               </h1>
            </div>
         </div>
      </li>
      <li class>
         <div class>
            <div class>
               <a class>
               <h1 class="title entry-title">
                  <a href="https://other_link_i_want_to_extract.net">
                  <span class>
               </h1>
            </div>
         </div>
      </li>
   </ol>
</div>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(s, 'html.parser')

for item in soup.find_all("h1", attrs={"class" : "title entry-title"}):
    for link in item.find_all('a',href=True):
        print('bs link:', link['href'])

from pyquery import PyQuery as pq
from lxml import etree

d = pq(s)
for link in d('h1.title.entry-title > a'):
    print('pq link:', pq(link).attr('href'))

bs link: http://www.link_i_want_to_extract.com
bs link: https://other_link_i_want_to_extract.net
pq link: http://www.link_i_want_to_extract.com
pq link: https://other_link_i_want_to_extract.net