Python 从足球招募网站抓取表格_Python_Python 3.x_Selenium_Beautifulsoup_Google Colaboratory

Python 从足球招募网站抓取表格

python python-3.x selenium google-colaboratory

Python 从足球招募网站抓取表格,python,python-3.x,selenium,beautifulsoup,google-colaboratory,Python,Python 3.x,Selenium,Beautifulsoup,Google Colaboratory,我想创建与以下网页所示完全相同的表格：我目前正在使用Selenium和Beautiful Soup在Google Colab笔记本上实现这一点，因为我在执行“read_html”命令时遇到了禁止的错误。我刚刚开始得到一些输出，但我只想抓取文本，而不是它周围的外部内容这是我到目前为止的代码 from kora.selenium import wd from bs4 import BeautifulSoup import pandas as pd import time import datet

我想创建与以下网页所示完全相同的表格：

我目前正在使用Selenium和Beautiful Soup在Google Colab笔记本上实现这一点，因为我在执行“read_html”命令时遇到了禁止的错误。我刚刚开始得到一些输出，但我只想抓取文本，而不是它周围的外部内容

这是我到目前为止的代码

from kora.selenium import wd
from bs4 import BeautifulSoup
import pandas as pd
import time
import datetime as dt
import os
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

url = 'https://247sports.com/college/penn-state/Season/2022-Football/Commits/'
wd.get(url)
time.sleep(5)

soup  = BeautifulSoup(wd.page_source)

school=soup.find_all('span', class_='meta')    
name=soup.find_all('div', class_='recruit')
position = soup.find_all('div', class_="position")
height_weight = soup.find_all('div', class_="metrics")
rating = soup.find_all('span', class_='score')
nat_rank = soup.find_all('a', class_='natrank')
state_rank = soup.find_all('a', class_='sttrank')
pos_rank = soup.find_all('a', class_='posrank')
status = soup.find_all('p', class_='commit-date withDate')

status

…这是我的输出

[<p class="commit-date withDate"> Commit 7/25/2020  </p>,
 <p class="commit-date withDate"> Commit 9/4/2020  </p>,
 <p class="commit-date withDate"> Commit 1/1/2021  </p>,
 <p class="commit-date withDate"> Commit 3/8/2021  </p>,
 <p class="commit-date withDate"> Commit 10/29/2020  </p>,
 <p class="commit-date withDate"> Commit 7/28/2020  </p>,
 <p class="commit-date withDate"> Commit 9/8/2020  </p>,
 <p class="commit-date withDate"> Commit 8/3/2020  </p>,
 <p class="commit-date withDate"> Commit 5/1/2021  </p>]

[提交2020年7月25日，提交2020年4月9日，提交2021年1月1日，提交日期：2021年3月8日，提交2020年10月29日，提交日期：2020年7月28日，提交2020年9月8日，提交2020年8月3日，

提交2021年5月1日

非常感谢您在这方面提供的任何帮助。

无需使用

Selenium

，您需要指定HTTP头才能从网站获得响应，否则，网站会认为您是一个机器人，会阻止您

要创建

数据帧

，请参阅此示例：

import pandas as pd
import requests
from bs4 import BeautifulSoup


url = "https://247sports.com/college/penn-state/Season/2022-Football/Commits/"
# Add the `user-agent` otherwise we will get blocked when sending the request
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}


response = requests.get(url, headers=headers).content
soup = BeautifulSoup(response, "html.parser")
data = []

for tag in soup.find_all("li", class_="ri-page__list-item")[1:]:  # `[1:]` Since the first result is a table header
    school = tag.find_next("span", class_="meta").text
    name = tag.find_next("a", class_="ri-page__name-link").text
    position = tag.find_next("div", class_="position").text
    height_weight = tag.find_next("div", class_="metrics").text
    rating = tag.find_next("span", class_="score").text
    nat_rank = tag.find_next("a", class_="natrank").text
    state_rank = tag.find_next("a", class_="sttrank").text
    pos_rank = tag.find_next("a", class_="posrank").text
    status = tag.find_next("p", class_="commit-date withDate").text

    data.append(
        {
            "school": school,
            "name": name,
            "position": position,
            "height_weight": height_weight,
            "rating": rating,
            "nat_rank": nat_rank,
            "state_rank": state_rank,
            "pos_rank": pos_rank,
            "status": status,
        }
    )

df = pd.DataFrame(data)

print(df.to_string())

输出：

                                                    school            name position height_weight  rating nat_rank state_rank pos_rank                status
0                  Westerville South (Westerville, OH)      Kaden Saunders      WR    5-10 / 172   0.9509      116          5       16    Commit 7/25/2020  
1                          IMG Academy (Bradenton, FL)        Drew Shelton      OT     6-5 / 290   0.9468      130         17       14     Commit 9/4/2020  
2                Central Dauphin East (Harrisburg, PA)       Mehki Flowers      WR     6-1 / 190   0.9461      131          4       18     Commit 1/1/2021  
3                                  Medina (Medina, OH)          Drew Allar     PRO     6-5 / 220   0.9435      138          6        8     Commit 3/8/2021  
4                     Manheim Township (Lancaster, PA)        Anthony Ivey      WR     6-0 / 190   0.9249      190          6       26   Commit 10/29/2020  
5                                 King (Milwaukee, WI)         Jerry Cross      TE     6-6 / 218   0.9153      218          4        8    Commit 7/28/2020  
6                         Northeast (Philadelphia, PA)          Ken Talley     WDE     6-3 / 230   0.9069      253          9       13     Commit 9/8/2020  
7                              Central York (York, PA)        Beau Pribula    DUAL     6-2 / 215   0.8891      370         12        9     Commit 8/3/2020  
8   The Williston Northampton School (Easthampton, MA)       Maleek McNeil      OT     6-8 / 340   0.8593      705          8       64     Commit 5/1/2021

这种方法与硒的主要区别是什么？据我所知，硒元素是动态加载元素和bs更快的最佳方式。@Vitalis正是如此。但是，在这种情况下，页面不是动态加载的，OP似乎使用了Selenium，因为它们被阻止了。因此，我们添加了

用户代理

，以避免被阻止。Selenium还能够添加

用户代理

。但我同意这种情况下的方法是最好的。@Vitalis是的，我的意思是OP认为页面是动态加载的，因为他们在发送请求时没有得到响应。对我来说很有用！非常感谢你的帮助。