Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/332.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Beautifulsoup HTML表解析--只能获取最后一行?_Python_Html_Parsing_Beautifulsoup - Fatal编程技术网

Python Beautifulsoup HTML表解析--只能获取最后一行?

Python Beautifulsoup HTML表解析--只能获取最后一行?,python,html,parsing,beautifulsoup,Python,Html,Parsing,Beautifulsoup,我有一个简单的HTML表要解析,但不知何故,Beautifulsoup只能从最后一行获取结果。我想知道是否有人会看看这是怎么回事。因此,我已经从HTML表中创建了rows对象: <table class='participants-table'> <thead> <tr> <th data-field="name" class="sort-direction-toggle name">Name</th

我有一个简单的HTML表要解析,但不知何故,Beautifulsoup只能从最后一行获取结果。我想知道是否有人会看看这是怎么回事。因此,我已经从HTML表中创建了rows对象:

 <table class='participants-table'>
    <thead>
      <tr>
          <th data-field="name" class="sort-direction-toggle name">Name</th>
          <th data-field="type" class="sort-direction-toggle type active-sort asc">Type</th>
          <th data-field="sector" class="sort-direction-toggle sector">Sector</th>
          <th data-field="country" class="sort-direction-toggle country">Country</th>
          <th data-field="joined_on" class="sort-direction-toggle joined-on">Joined On</th>
      </tr>
    </thead>
    <tbody>
        <tr>
          <th class='name'><a href="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th>
          <td class='type'>Company</td>
          <td class='sector'>General Industrials</td>
          <td class='country'>Netherlands</td>
          <td class='joined-on'>2000-09-20</td>
        </tr>
        <tr>
          <th class='name'><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>
          <td class='type'>Company</td>
          <td class='sector'>Pharmaceuticals &amp; Biotechnology</td>
          <td class='country'>Portugal</td>
          <td class='joined-on'>2004-02-19</td>
        </tr>
    </tbody>
  </table>
这将得到:

rows=[<tr>
 <th class="name"><a href="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th>
 <td class="type">Company</td>
 <td class="sector">General Industrials</td>
 <td class="country">Netherlands</td>
 <td class="joined-on">2000-09-20</td>
 </tr>, <tr>
 <th class="name"><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>
 <td class="type">Company</td>
 <td class="sector">Pharmaceuticals &amp; Biotechnology</td>
 <td class="country">Portugal</td>
 <td class="joined-on">2004-02-19</td>
 </tr>]
我只能得到最后一个条目

cells=[<th class="name"><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>]
单元格=[]

发生了什么事?这是我第一次使用beautifulsoup,我想做的是将此表导出到CSV。非常感谢您的帮助!谢谢

如果您想在一个列表中重新分配所有th标记,您只需不断重新分配
cells=row。find_all('th')
,这样当您在循环外打印单元格时,您将只看到它上次分配给的内容,即上次tr中的最后一个th:

cells = []
for row in rows:
 cells.extend(row.find_all('th'))
此外,由于只有一个表,您可以使用“查找”:

如果要跳过thead行,可以使用css选择器:

单元格将为您提供:

[<th class="name"><a href="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th>, <th class="name"><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>]
Name,Type,Sector,Country,Joined On,URL
Grontmij,Company,General Industrials,Netherlands,2000-09-20,/what-is-gc/participants/4479-Grontmij
Groupe Bial,Company,Pharmaceuticals & Biotechnology,Portugal,2004-02-19,/what-is-gc/participants/4492-Groupe-Bial
对于您的样本,您将获得:

[<th class="name"><a href="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th>, <th class="name"><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>]
Name,Type,Sector,Country,Joined On,URL
Grontmij,Company,General Industrials,Netherlands,2000-09-20,/what-is-gc/participants/4479-Grontmij
Groupe Bial,Company,Pharmaceuticals & Biotechnology,Portugal,2004-02-19,/what-is-gc/participants/4492-Groupe-Bial

行是如何定义的?谢谢!提供了关于表和代码的更多详细信息。它正按照您的要求执行。您是否正在尝试获取所有的
td
?谢谢!这使它更接近我的目标!实际上,我想做的是将这个表导出为一个典型的CSV格式,并将“Name”和html链接作为单独的列。有没有办法用你刚才建议的“扩展”方法来实现这一点?谢谢@AD233,所以您基本上希望在csv中重新创建表?这是正确的,但我想将href链接提取为单独的列。谢谢好极了谢谢你的帮助!这确实很有趣
[<th class="name"><a href="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th>, <th class="name"><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>]
import csv

soup = BeautifulSoup(html, "html.parser")

rows = soup.select("table.participants-table tr")

with open("data.csv", "w") as out:
    wr = csv.writer(out)
    wr.writerow([th.text for th in rows[0].find_all("th")] + ["URL"])

    for row in rows[1:]:
        wr.writerow([tag.text for tag in row.find_all()] + [row.th.a["href"]])
Name,Type,Sector,Country,Joined On,URL
Grontmij,Company,General Industrials,Netherlands,2000-09-20,/what-is-gc/participants/4479-Grontmij
Groupe Bial,Company,Pharmaceuticals & Biotechnology,Portugal,2004-02-19,/what-is-gc/participants/4492-Groupe-Bial