使用BeautifulSoup在python中解析带有img标记的表_Python_Html Parsing_Beautifulsoup

使用BeautifulSoup在python中解析带有img标记的表

python

使用BeautifulSoup在python中解析带有img标记的表,python,html-parsing,beautifulsoup,Python,Html Parsing,Beautifulsoup,我正在使用解析html页面。我需要处理页面中的第一个表格。那张表有几行。然后，每行包含一些“td”标记，其中一个“td”标记有一个“img”标记。我想得到那张表上的所有信息。但是，如果我打印该表，我不会得到任何与“img”标记相关的数据我使用soap.findAll（“table”）获取所有表，然后选择第一个表进行处理。html的外观如下所示： <table id="abc" <tr class="listitem-even"> <td class="lis

我正在使用解析html页面。我需要处理页面中的第一个表格。那张表有几行。然后，每行包含一些“td”标记，其中一个“td”标记有一个“img”标记。我想得到那张表上的所有信息。但是，如果我打印该表，我不会得到任何与“img”标记相关的数据

我使用soap.findAll（“table”）获取所有表，然后选择第一个表进行处理。html的外观如下所示：

<table id="abc"
  <tr class="listitem-even">
    <td class="listitem-even">
      <table border = "0"> <tr> <td class="gridcell">
               <img id="img_id" title="img_title" src="img_src" alt="img_alt" /> </td> </tr>
      </table>
    </td>
    <td class="listitem-even"
      <span>some_other_information</span>
    </td>
  </tr>
</table>

您有一个嵌套的表，因此在解析tr/td/img标记之前，需要检查您在树中的位置
from bs4 import BeautifulSoup
f = open('test.html', 'rb')
html = f.read()
f.close()
soup = BeautifulSoup(html)

tables = soup.find_all('table')

for table in tables:
     if table.find_parent("table") is not None:
         for tr in table.find_all('tr'):
                 for td in table.find_all('td'):
                         for img in td.find_all('img'):
                                 print img['id']
                                 print img['src']
                                 print img['title']
                                 print img['alt']

它根据您的示例返回以下内容：
img_id
img_src
img_title
img_alt

soup.find（'table'）
将为您提供第一张桌子；如果只需要第一个元素，则无需查找所有元素。并且可以在任何BeautifulSoup元素上使用.find
和.find_all（）
table.find（'img'）
也会提供图像。您希望提取哪些信息？谢谢这些提示，我可以使用td.find（'img'）或类似的工具吗？我想找出什么是src标记以及它与什么'td'关联。我实际上需要阅读'img'标记的标题，然后我必须根据该标题决定相应的'td'是否对我有价值。soup.findall（'img'）给我所有图像，但table=soup.find（'table'），然后table.findall（'img'）给我“无”有什么想法吗？