Python：使用Beautifulsoup从多个td和th获取值时没有标记_Python_Html_Web Scraping_Beautifulsoup

Python：使用Beautifulsoup从多个td和th获取值时没有标记

python html web-scraping

Python：使用Beautifulsoup从多个td和th获取值时没有标记,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我有一个页面看起来像这样 <tr> <th class="fst" scope="col">time(*)</th> <th scope="col">field</th> <th scope="col">1 session</th> <th scope="col">2 session</th> <th scope="col">3 sess

我有一个页面看起来像这样

<tr>
    <th class="fst" scope="col">time(*)</th>
    <th scope="col">field</th>
    <th scope="col">1 session</th>
    <th scope="col">2 session</th>
    <th scope="col">3 session</th>
    <th scope="col">4 session</th>
    <th scope="col">5 session</th>
    <th scope="col">6 session</th>
</tr>
<tr>
   <th class="num_area" rowspan="11" scope="row">77</th>
   <td class="txt_category">bus</td>
   <td>58456</td>                                                                   
   <td>62891</td>                                                                    
   <td>63076</td>                                                             
   <td>53282</td>                                                                 
   <td>54805</td>                                                             
   <td>55097</td>
</tr>
<tr>
   <td class="txt_category">taxi</td>
   <td>-</td>
   <td>-</td>
   <td>-</td>
   <td>62891</td>
   <td>-</td>
   <td>-</td>
</tr>
<tr>                         
    <th class="fst" scope="col">time(*)</th>
    <th scope="col">field</th>
    <th scope="col">7 session</th>
    <th scope="col">8 session</th>
    <th scope="col">9 session</th>
    <th scope="col">10 session</th>
    <th scope="col">11 session</th>
    <th scope="col">12 session</th>
</tr>
<tr>
   <th class="num_area" rowspan="11" scope="row">100</th>
   <td class="txt_category">bus</td>
   <td>1342</td>                                                                   
   <td>138470</td>                                                                    
   <td>878840</td>                                                             
   <td>7653</td>                                                                 
   <td>4422</td>                                                             
   <td>87630</td>
</tr>

def scraping():
    driver = webdriver.PhantomJS()
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html5lib')
    result = []
    for row in soup.findAll('tr'):
       header = row.findAll('th')
       if len(header) < 1:
           continue
       if len(header) == 7:
           for num in range(1, 7):
               date = header[num].find(text=True)

       if len(header) == 8:
           for num in range(1, 8):
               date = header[num].find(text=True)
       body = row.findAll('td')
       if len(body) < 1:
           continue
       field_name = body[0].find(text=True)
       template['field_name'] = field_name
       for num in range(1, 7):
           cost = body[num].find(text=True)
           template['cost'] = cost
       result.append(template)

到目前为止我都是这样尝试的

<tr>
    <th class="fst" scope="col">time(*)</th>
    <th scope="col">field</th>
    <th scope="col">1 session</th>
    <th scope="col">2 session</th>
    <th scope="col">3 session</th>
    <th scope="col">4 session</th>
    <th scope="col">5 session</th>
    <th scope="col">6 session</th>
</tr>
<tr>
   <th class="num_area" rowspan="11" scope="row">77</th>
   <td class="txt_category">bus</td>
   <td>58456</td>                                                                   
   <td>62891</td>                                                                    
   <td>63076</td>                                                             
   <td>53282</td>                                                                 
   <td>54805</td>                                                             
   <td>55097</td>
</tr>
<tr>
   <td class="txt_category">taxi</td>
   <td>-</td>
   <td>-</td>
   <td>-</td>
   <td>62891</td>
   <td>-</td>
   <td>-</td>
</tr>
<tr>                         
    <th class="fst" scope="col">time(*)</th>
    <th scope="col">field</th>
    <th scope="col">7 session</th>
    <th scope="col">8 session</th>
    <th scope="col">9 session</th>
    <th scope="col">10 session</th>
    <th scope="col">11 session</th>
    <th scope="col">12 session</th>
</tr>
<tr>
   <th class="num_area" rowspan="11" scope="row">100</th>
   <td class="txt_category">bus</td>
   <td>1342</td>                                                                   
   <td>138470</td>                                                                    
   <td>878840</td>                                                             
   <td>7653</td>                                                                 
   <td>4422</td>                                                             
   <td>87630</td>
</tr>

def scraping():
    driver = webdriver.PhantomJS()
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html5lib')
    result = []
    for row in soup.findAll('tr'):
       header = row.findAll('th')
       if len(header) < 1:
           continue
       if len(header) == 7:
           for num in range(1, 7):
               date = header[num].find(text=True)

       if len(header) == 8:
           for num in range(1, 8):
               date = header[num].find(text=True)
       body = row.findAll('td')
       if len(body) < 1:
           continue
       field_name = body[0].find(text=True)
       template['field_name'] = field_name
       for num in range(1, 7):
           cost = body[num].find(text=True)
           template['cost'] = cost
       result.append(template)

def刮片（）：
driver=webdriver.PhantomJS（）
获取驱动程序（url）
soup=BeautifulSoup（driver.page_源代码'html5lib'）
结果=[]
对于汤中的行。findAll（'tr'）：
header=row.findAll（'th'）
如果长度（收割台）<1：
持续
如果len（标题）==7：
对于范围（1,7）中的num：
日期=标题[num]。查找（text=True）
如果len（标题）==8：
对于范围（1,8）中的num：
日期=标题[num]。查找（text=True）
body=row.findAll（'td'）
如果透镜（主体）<1：
持续
字段名称=正文[0]。查找（text=True）
模板['field_name']=字段名称
对于范围（1,7）中的num：
成本=正文[num]。查找（text=True）
模板['cost']=成本
result.append（模板）

有时长度是7，有时是8，所以我决定使用范围。然而，在使用它之后，结果列表似乎只有一个字典，这不是我想要的。我想知道是否有好的方法来废除这些价值观