Python 尝试使用BeautifulSoup进行嵌套刮取_Python_Html_Beautifulsoup

Python 尝试使用BeautifulSoup进行嵌套刮取

python html

Python 尝试使用BeautifulSoup进行嵌套刮取,python,html,beautifulsoup,Python,Html,Beautifulsoup,我的代码如下： <h1><a name="hello">Hello</a></h1> <div class="colmask"> <div class="box box_1"> <h4><a>My Favorite Number is</a></h4> <ul><li><a>1</a></li></ul>

我的代码如下：

<h1><a name="hello">Hello</a></h1>
<div class="colmask">
<div class="box box_1">
<h4><a>My Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
<div class="box box_2">
<h4><a>Your Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
</div>
<h1 name="goodbye"><a>Goodbye</a></h1>
<div class="colmask">
<div class="box box_1">
<h4><a>Their Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
<div class="box box_2">
<h4><a>Our Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
</div>

如果您在

nextSibling

方面遇到问题，那是因为您的html实际上是这样的：

<h1><a name="hello">Hello</a></h1>\n #<---newline
<div class="colmask">

以下是孩子们的编号：

<div>\n #<---newline plus spaces at start of next line = child 0
  <div>hello</div>\n #<--newline plus spaces at start of next line = child 2
  <div>world</div>\n #<--newline plus spaces at start of next line = child 4
  <div>goodbye</div>\n #<--newline = child 6
<div>

但是要跳过标记之间的所有空白，使用

findNextSibling（）

更容易，它允许您指定要查找的下一个同级的标记名：

findNextSibling('div')

以下是一个例子：

from BeautifulSoup import BeautifulSoup

with open('data2.txt') as f:
    html = f.read()

soup = BeautifulSoup(html)

for h1 in soup.findAll('h1'):
    colmask_div = h1.findNextSibling('div')

    for box_div in colmask_div.findAll('div'):
        h4 = box_div.find('h4')

        for ul in box_div.findAll('ul'):
            print'{} : {} : {}'.format(h1.text, h4.text, ul.li.a.text)



--output:--
Hello : My Favorite Number is : 1
Hello : My Favorite Number is : 2
Hello : My Favorite Number is : 3
Hello : My Favorite Number is : 4
Hello : Your Favorite Number is : 1
Hello : Your Favorite Number is : 2
Hello : Your Favorite Number is : 3
Hello : Your Favorite Number is : 4
Goodbye : Their Favorite Number is : 1
Goodbye : Their Favorite Number is : 2
Goodbye : Their Favorite Number is : 3
Goodbye : Their Favorite Number is : 4
Goodbye : Our Favorite Number is : 1
Goodbye : Our Favorite Number is : 2
Goodbye : Our Favorite Number is : 3
Goodbye : Our Favorite Number is : 4

谢谢你。你真的帮助我了解了BeautifulSoup的工作原理。

<h1><a name="hello">Hello</a></h1>\n #<---newline
<div class="colmask">

h1.nextSibling.nextSibling

findNextSibling('div')

from BeautifulSoup import BeautifulSoup

with open('data2.txt') as f:
    html = f.read()

soup = BeautifulSoup(html)

for h1 in soup.findAll('h1'):
    colmask_div = h1.findNextSibling('div')

    for box_div in colmask_div.findAll('div'):
        h4 = box_div.find('h4')

        for ul in box_div.findAll('ul'):
            print'{} : {} : {}'.format(h1.text, h4.text, ul.li.a.text)



--output:--
Hello : My Favorite Number is : 1
Hello : My Favorite Number is : 2
Hello : My Favorite Number is : 3
Hello : My Favorite Number is : 4
Hello : Your Favorite Number is : 1
Hello : Your Favorite Number is : 2
Hello : Your Favorite Number is : 3
Hello : Your Favorite Number is : 4
Goodbye : Their Favorite Number is : 1
Goodbye : Their Favorite Number is : 2
Goodbye : Their Favorite Number is : 3
Goodbye : Their Favorite Number is : 4
Goodbye : Our Favorite Number is : 1
Goodbye : Our Favorite Number is : 2
Goodbye : Our Favorite Number is : 3
Goodbye : Our Favorite Number is : 4