Python 无法刮取注释组后面的内容_Python_Html_Web Scraping_Beautifulsoup_Comments

Python 无法刮取注释组后面的内容

python html web-scraping

Python 无法刮取注释组后面的内容,python,html,web-scraping,beautifulsoup,comments,Python,Html,Web Scraping,Beautifulsoup,Comments,我正试图从下一页中删除表格：当我到达击球表的html时，我遇到一条很长的注释，其中包含击球表的html <div id="all_WashingtonSenatorsbatting" class="table_wrapper table_controls"> <div class="section_heading"> <div class="section_heading_text"> <div class="place

我正试图从下一页中删除表格：

当我到达击球表的html时，我遇到一条很长的注释，其中包含击球表的html

<div id="all_WashingtonSenatorsbatting" class="table_wrapper table_controls">
     <div class="section_heading">
     <div class="section_heading_text">
     <div class="placeholder"></div>
     <!-- 
        <div class="table_outer_container">
        .....
        -->
     <div class="table_outer_container mobile_table">
     <div class="footer no_hide_long">

s = requests.get(url).content
soup = BeautifulSoup(s, "html.parser")
table = soup.find_all('div', {'class':'table_wrapper'})[0]
comment = t(text=lambda x: isinstance(x, Comment))[0]
newsoup = BeautifulSoup(comment, 'html.parser')
table = newsoup.find('table')

给我

Out[1]: 3

当

div id=“all_WashingtonSenatorsbatting”

元素下明显有5个

div

子元素时

即使我使用

from bs4 import Comment
for comments in soup.findAll(text=lambda text:isinstance(text, Comment)):
     comments.extract()

生成的汤仍然不包含我要刮取的最后两个

div

元素。我正在尝试使用正则表达式来处理代码，但到目前为止运气不好，有什么建议吗

我找到了可行的解决方案，通过使用下面的代码，我提取了注释（它附带了我想要刮取的最后两个

div

元素），在BeautifulSoup中再次处理它并刮取表

<div id="all_WashingtonSenatorsbatting" class="table_wrapper table_controls">
     <div class="section_heading">
     <div class="section_heading_text">
     <div class="placeholder"></div>
     <!-- 
        <div class="table_outer_container">
        .....
        -->
     <div class="table_outer_container mobile_table">
     <div class="footer no_hide_long">

s = requests.get(url).content
soup = BeautifulSoup(s, "html.parser")
table = soup.find_all('div', {'class':'table_wrapper'})[0]
comment = t(text=lambda x: isinstance(x, Comment))[0]
newsoup = BeautifulSoup(comment, 'html.parser')
table = newsoup.find('table')

我花了一段时间才了解到这一点，我很想看看是否有人提出了其他解决方案，或者可以解释这个问题是如何产生的