Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/82.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何在解析html表时忽略th标记?_Python_Html_Beautifulsoup - Fatal编程技术网

Python 如何在解析html表时忽略th标记?

Python 如何在解析html表时忽略th标记?,python,html,beautifulsoup,Python,Html,Beautifulsoup,您好,我对使用python和beautifulsoup4解析html表非常陌生。一切都进行得很顺利,直到我遇到这个奇怪的表,它在表的中间使用了一个“th”标记,导致我的解析退出并抛出一个“index超出范围”错误。我试过搜索SO和google,但都没有用。问题是,在解析表时,我如何忽略或去除这个恶意的“th”标记 以下是我目前掌握的代码: from mechanize import Browser from bs4 import BeautifulSoup mech = Browser() u

您好,我对使用python和beautifulsoup4解析html表非常陌生。一切都进行得很顺利,直到我遇到这个奇怪的表,它在表的中间使用了一个“th”标记,导致我的解析退出并抛出一个“index超出范围”错误。我试过搜索SO和google,但都没有用。问题是,在解析表时,我如何忽略或去除这个恶意的“th”标记

以下是我目前掌握的代码:

from mechanize import Browser
from bs4 import BeautifulSoup

mech = Browser()
url = 'https://www.moscone.com/site/do/event/list'
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
table = soup.find('table', { 'id' : 'list' })

for row in table.findAll('tr')[3:]:
    col = row.findAll('td')
    date = col[0].string
    name = col[1].string
    location = col[2].string
    record = (name, date, location)
    final = ','.join(record)
    print(final)
下面是导致我的错误的html的一个小片段

  <td>
   Convention
  </td>
 </tr>
 <tr>
  <th class="title" colspan="4">
   Mon Dec 01 00:00:00 PST 2014
  </th>
 </tr>
 <tr>
  <td>
   12/06/14 - 12/09/14
  </td>

惯例
2014年12月1日星期一00:00:00太平洋标准时间
12/06/14 - 12/09/14

我确实需要表中这个表示新月开始的流氓“th”上下的数据

您只需检查
th
是否在
中,如果不在,则解析内容,如下所示:

for row in table.findAll('tr')[3:]:
    # so make sure th is not in row
    if not row.find_all('th'):
        col = row.findAll('td')
        date = col[0].string
        name = col[1].string
        location = col[2].string
        record = (name, date, location)
        final = ','.join(record)
        print(final)
这是我将从您提供的url中获得的结果,无需索引器:

Out & Equal Workplace,11/03/14 - 11/06/14,Moscone West 
Samsung Developer Conference,11/11/14 - 11/13/14,Moscone West  
North American Spine Society (NASS) Annual Meeting,11/12/14 - 11/15/14,Moscone South and Esplanade Ballroom 
San Francisco International Auto Show,11/22/14 - 11/29/14,Moscone North & South 
67th Annual Meeting of the APS Division of Fluid Dynamics,11/23/14 - 11/25/14,Moscone North, South and West 
American Society of Hematology,12/06/14 - 12/09/14,Moscone North, South and West 
California School Boards Association,12/12/14 - 12/16/14,Moscone North & Esplanade Ballroom 
American Geophysical Union,12/15/14 - 12/19/14,Moscone North & South

非常感谢,加上我有限的知识,将来一定会派上用场:)