Python BeautifulSoup解析非结构化html_Python_Beautifulsoup_Html Parsing

Python BeautifulSoup解析非结构化html

python

Python BeautifulSoup解析非结构化html,python,beautifulsoup,html-parsing,Python,Beautifulsoup,Html Parsing,正在尝试使用BeautifulSoup分析此html： <div class="container"> Monday Some info here... and then some Tuesday Some info here... Wednesday<

正在尝试使用BeautifulSoup分析此html：

<div class="container">
  <strong>Monday</strong> Some info here...<br /> and then some <br />
  <strong>Tuesday</strong> Some info here...<br />
  <strong>Wednesday</strong> Some info here...<br />
  ...
</div>


周一这里有一些信息…
然后是一些

周二这里有一些信息…

周三这里有一些信息…

...

我只想获得周二的数据：

周二这里有一些信息…

但是由于没有包装器div，我很难只获取这些数据。有什么建议吗

这样怎么样：

from bs4 import BeautifulSoup

html = """<div class="container">
  <strong>Monday</strong> Some info here...<br /> and then some <br />
  <strong>Tuesday</strong> Some info here...<br />
  <strong>Wednesday</strong> Some info here...<br />
  ...
</div>"""
soup = BeautifulSoup(html)
result = soup.find('strong', text='Tuesday').findNextSibling(text=True)
print(result.decode('utf-8'))

根据评论更新：

 Some info here...

 Some info here...
 and then some

基本上，您可以继续获取星期二的下一个同级文本，直到文本的下一个同级元素是另一个

元素或无
from bs4 import BeautifulSoup

html = """<div class="container">
  <strong>Monday</strong> Some info here...<br /> and then some <br />
  <strong>Tuesday</strong> Some info here...<br /> and then some <br />
  <strong>Wednesday</strong> Some info here...<br />
  ...
</div>"""
soup = BeautifulSoup(html)
result = soup.find('strong', text='Tuesday').findNextSibling(text=True)
nextSibling = result.findNextSibling()
while nextSibling and nextSibling.name != 'strong':
    print(result.decode('utf-8'))
    result = nextSibling.findNextSibling(text=True)
    nextSibling = result.findNextSibling()

这样怎么样：
from bs4 import BeautifulSoup

html = """<div class="container">
  <strong>Monday</strong> Some info here...<br /> and then some <br />
  <strong>Tuesday</strong> Some info here...<br />
  <strong>Wednesday</strong> Some info here...<br />
  ...
</div>"""
soup = BeautifulSoup(html)
result = soup.find('strong', text='Tuesday').findNextSibling(text=True)
print(result.decode('utf-8'))

根据评论更新：
 Some info here...

 Some info here...
 and then some 

基本上，您可以继续获取星期二的下一个同级文本，直到文本的下一个同级元素是另一个元素或无
from bs4 import BeautifulSoup

html = """<div class="container">
  <strong>Monday</strong> Some info here...<br /> and then some <br />
  <strong>Tuesday</strong> Some info here...<br /> and then some <br />
  <strong>Wednesday</strong> Some info here...<br />
  ...
</div>"""
soup = BeautifulSoup(html)
result = soup.find('strong', text='Tuesday').findNextSibling(text=True)
nextSibling = result.findNextSibling()
while nextSibling and nextSibling.name != 'strong':
    print(result.decode('utf-8'))
    result = nextSibling.findNextSibling(text=True)
    nextSibling = result.findNextSibling()

这样怎么样：
from bs4 import BeautifulSoup

html = """<div class="container">
  <strong>Monday</strong> Some info here...<br /> and then some <br />
  <strong>Tuesday</strong> Some info here...<br />
  <strong>Wednesday</strong> Some info here...<br />
  ...
</div>"""
soup = BeautifulSoup(html)
result = soup.find('strong', text='Tuesday').findNextSibling(text=True)
print(result.decode('utf-8'))

根据评论更新：
 Some info here...

 Some info here...
 and then some 

基本上，您可以继续获取星期二的下一个同级文本，直到文本的下一个同级元素是另一个元素或无
from bs4 import BeautifulSoup

html = """<div class="container">
  <strong>Monday</strong> Some info here...<br /> and then some <br />
  <strong>Tuesday</strong> Some info here...<br /> and then some <br />
  <strong>Wednesday</strong> Some info here...<br />
  ...
</div>"""
soup = BeautifulSoup(html)
result = soup.find('strong', text='Tuesday').findNextSibling(text=True)
nextSibling = result.findNextSibling()
while nextSibling and nextSibling.name != 'strong':
    print(result.decode('utf-8'))
    result = nextSibling.findNextSibling(text=True)
    nextSibling = result.findNextSibling()

这样怎么样：
from bs4 import BeautifulSoup

html = """<div class="container">
  <strong>Monday</strong> Some info here...<br /> and then some <br />
  <strong>Tuesday</strong> Some info here...<br />
  <strong>Wednesday</strong> Some info here...<br />
  ...
</div>"""
soup = BeautifulSoup(html)
result = soup.find('strong', text='Tuesday').findNextSibling(text=True)
print(result.decode('utf-8'))

根据评论更新：
 Some info here...

 Some info here...
 and then some 

基本上，您可以继续获取星期二的下一个同级文本，直到文本的下一个同级元素是另一个元素或无
from bs4 import BeautifulSoup

html = """<div class="container">
  <strong>Monday</strong> Some info here...<br /> and then some <br />
  <strong>Tuesday</strong> Some info here...<br /> and then some <br />
  <strong>Wednesday</strong> Some info here...<br />
  ...
</div>"""
soup = BeautifulSoup(html)
result = soup.find('strong', text='Tuesday').findNextSibling(text=True)
nextSibling = result.findNextSibling()
while nextSibling and nextSibling.name != 'strong':
    print(result.decode('utf-8'))
    result = nextSibling.findNextSibling(text=True)
    nextSibling = result.findNextSibling()

是的，但它只包括html到第一个
标记，我需要从到下一个的所有内容。user1121487您最初的问题是您在第一个答案“仅获取星期二的数据：星期二此处的一些信息…”中得到的信息。如果您想要“从到下一个”的所有内容，您应该在最初就明确这一点@har07最初的回答满足了您最初的要求。我认为从示例中的结构可以非常清楚地看出，我需要从强到强的所有内容，这就是周二的所有内容，因为您无法知道将有多少br等@serkYes，但它只包括html到第一个
标记，我需要从到下一个的所有内容。user1121487您最初的问题是您在第一个答案中得到的内容：“仅获取星期二的数据：星期二此处的一些信息…”。如果您想要“从到下一个”的所有内容，您应该在最初就明确这一点@har07最初的回答满足了您最初的要求。我认为从示例中的结构可以非常清楚地看出，我需要从强到强的所有内容，这就是周二的所有内容，因为您无法知道将有多少br等@serkYes，但它只包括html到第一个
标记，我需要从到下一个的所有内容。user1121487您最初的问题是您在第一个答案中得到的内容：“仅获取星期二的数据：星期二此处的一些信息…”。如果您想要“从到下一个”的所有内容，您应该在最初就明确这一点@har07最初的回答满足了您最初的要求。我认为从示例中的结构可以非常清楚地看出，我需要从强到强的所有内容，这就是周二的所有内容，因为您无法知道将有多少br等@serkYes，但它只包括html到第一个
标记，我需要从到下一个的所有内容。user1121487您最初的问题是您在第一个答案中得到的内容：“仅获取星期二的数据：星期二此处的一些信息…”。如果您想要“从到下一个”的所有内容，您应该在最初就明确这一点@har07最初的回答满足了您最初的要求。我认为从示例中的结构可以非常清楚地看出，我需要从强到强的所有内容，这就是周二的所有内容，因为您无法知道将有多少br等@塞克