PythonWebScraping-尝试在表中查找行_Python_Web Scraping_Beautifulsoup

PythonWebScraping-尝试在表中查找行

python web-scraping

PythonWebScraping-尝试在表中查找行,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,尝试创建一个表，我想从表中提取大部分td数据。我可以从行中获取一些，但无法正确获取单个tds。提取td数据需要做什么？我需要在tds中获取数据，其中的名称类似于立式单元格，或者我可以在所有tds中获取数据并对其进行排序输出样本- [<tr class="standing-table__row"> <th class="standing-table__cell standing-table__header-cell" data-index="0" data-label="pos

尝试创建一个表，我想从表中提取大部分td数据。我可以从行中获取一些，但无法正确获取单个tds。提取td数据需要做什么？我需要在tds中获取数据，其中的名称类似于立式单元格，或者我可以在所有tds中获取数据并对其进行排序

输出样本-

[<tr class="standing-table__row">
<th class="standing-table__cell standing-table__header-cell" data-index="0" data-label="pos" title="Position">#</th>
<th class="standing-table__cell standing-table__header-cell standing-table__cell--name" data-index="1" title="Team">Team</th>
<th class="standing-table__cell standing-table__header-cell" data-index="2" data-label="pld" title="Played">Pl</th>
<th class="standing-table__cell standing-table__header-cell" data-index="9" data-label="pts" data-sort-value="use-attribute">Pts</th>
<th class="standing-table__cell standing-table__header-cell is-hidden--bp15 is-hidden--bp35 " data-index="10" data-sort-value="use-attribute">Last 6</th>
</tr>, <tr class="standing-table__row" data-item-id="345">
<td class="standing-table__cell">1</td>
<td class="standing-table__cell standing-table__cell--name" data-long-name="Manchester City" data-short-name="Manchester City">
<a class="standing-table__cell--name-link" href="/manchester-city">Manchester City</a>
</td>
<td class="standing-table__cell">9</td>
<td class="standing-table__cell is-hidden--bp15 is-hidden--bp35 " data-sort-value="16313333">
<div class="standing-table__form">
<span class="standing-table__form-cell standing-table__form-cell--win" title="Manchester City 2-1 Newcastle United"> </span><span class="standing-table__form-cell standing-table__form-cell--win" title="Manchester City 3-0 Fulham"> </span><span class="standing-table__form-cell standing-table__form-cell--win" title="Cardiff City 0-5 Manchester City"> </span><span class="standing-table__form-cell standing-table__form-cell--win" title="Manchester City 2-0 Brighton and Hove Albion"> </span><span class="standing-table__form-cell standing-table__form-cell--draw" title="Liverpool 0-0 Manchester City"> </span><span class="standing-table__form-cell standing-table__form-cell--win" title="Manchester City 5-0 Burnley"> </span> </div>
</td>
</tr>, <tr class="standing-table__row" data-item-id="155">
<td class="standing-table__cell">2</td>
<td class="standing-table__cell standing-table__cell--name" data-long-name="Liverpool" data-short-name="Liverpool">
  File "C:\Users\scrape.py", line 18, in <module>
    for td in premier_soup_tr.find_all('td', {'class': 'standing-table__cell'}):
  File "C:\Python\Python36\lib\site-packages\bs4\element.py", line 1884, in __getattr__
    "ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
>>>

html源代码看起来像-

    <tr class="standing-table__row" data-item-id="345">
  <td class="standing-table__cell">1</td>
  <td class="standing-table__cell standing-table__cell--name" data-short-name="Manchester City" data-long-name="Manchester City">

            <a href="/manchester-city" class="standing-table__cell--name-link">Manchester City</a>

  </td>
  <td class="standing-table__cell">9</td>
  <td class="standing-table__cell">23</td>
  <td class="standing-table__cell is-hidden--bp15 is-hidden--bp35 " data-sort-value="16313333">
          <div class="standing-table__form">
      <span title="Manchester City 2-1 Newcastle United" class="standing-table__form-cell standing-table__form-cell--win"> </span><span title="Manchester City 3-0 Fulham" class="standing-table__form-cell standing-table__form-cell--win"> </span><span title="Cardiff City 0-5 Manchester City" class="standing-table__form-cell standing-table__form-cell--win"> </span><span title="Manchester City 2-0 Brighton and Hove Albion" class="standing-table__form-cell standing-table__form-cell--win"> </span><span title="Liverpool 0-0 Manchester City" class="standing-table__form-cell standing-table__form-cell--draw"> </span><span title="Manchester City 5-0 Burnley" class="standing-table__form-cell standing-table__form-cell--win"> </span>        </div>
        </td>

</tr>
    <tr class="standing-table__row" data-item-id="155">
  <td class="standing-table__cell">2</td>
  <td class="standing-table__cell standing-table__cell--name" data-short-name="Liverpool" data-long-name="Liverpool">

            <a href="/liverpool" class="standing-table__cell--name-link">Liverpool</a>

  </td>

你的想法是正确的，但是你必须根据你得到的做一些事情，

find\u all

将返回一组结果，你不能像

premier\u soup\tr.find\u all

，正确的方法是

premier\u soup\tr[position]。find\u all

这就是我所做的

import requests
from bs4 import BeautifulSoup
url = 'https://www.skysports.com/premier-league-table'
premier_r = requests.get(url)
print(premier_r.status_code)
premier_soup = BeautifulSoup(premier_r.text, 'html.parser')
premier_soup_tr = premier_soup.find_all('tr', {'class': 'standing-table__row'})
result = [[r.text.strip() for r in td.find_all('td', {'class': 'standing-table__cell'})][:-1] for td in premier_soup_tr[1:]]
print(result)

输出：

[['1', 'Manchester City', '9', '7', '2', '0', '26', '3', '23', '23'], ['2', 'Liverpool', '9', '7', '2', '0', '16', '3', '13', '23'], ['3', 'Chelsea', '9', '6', '3', '0', '20', '7', '13', '21'], ['4', 'Arsenal', '9', '7', '0', '2', '22', '11', '11', '21'], ['5', 'Tottenham Hotspur', '9', '7', '0', '2', '16', '7', '9', '21'], ['6', 'Bournemouth', '9', '5', '2', '2', '16', '12', '4', '17'],

谢谢以result开头的行看起来很复杂！你介意解释一下结果行的作用吗。。？是否从第二项到第二项再到最后一项都删除了空格？请输入包含多个

的

premier\u-soup\u tr

中的每个元素

[r.text.strip（）在td.find_all…

中将提取每个文本并删除

\n

[：-1]

和

[1:][code>正在决定将空结果传递到@anfieldThanks。但是我有点困惑，例如，为什么这不起作用（一旦你有了行）为什么第二行不行，premer_soup中的td.find_all部分..？在第一行中我们有行。在第二行中，我们尝试在每一行中获取tds..premier_soup.tr=premier_soup.find_all（'tr'，'class'：'standing-table_row'）premier_soup中的td.find_all（'td'，'class'：'standing-table_cell'）：print（td）我的回答是，find\u all
将返回一个列表，您不能将find\u all添加到列表中。但您可以在安菲尔德为其中的每个元素执行操作
[['1', 'Manchester City', '9', '7', '2', '0', '26', '3', '23', '23'], ['2', 'Liverpool', '9', '7', '2', '0', '16', '3', '13', '23'], ['3', 'Chelsea', '9', '6', '3', '0', '20', '7', '13', '21'], ['4', 'Arsenal', '9', '7', '0', '2', '22', '11', '11', '21'], ['5', 'Tottenham Hotspur', '9', '7', '0', '2', '16', '7', '9', '21'], ['6', 'Bournemouth', '9', '5', '2', '2', '16', '12', '4', '17'],