Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/344.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/88.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/haskell/9.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python HTML解析器(未命名级别)_Python_Html_Pandas_Web Scraping_Beautifulsoup - Fatal编程技术网

Python HTML解析器(未命名级别)

Python HTML解析器(未命名级别),python,html,pandas,web-scraping,beautifulsoup,Python,Html,Pandas,Web Scraping,Beautifulsoup,我正在制作一个屏幕刮板,从中提取足球统计数据。我目前正在从主要玩家的统计页面中抓取数据,然后进入他们的个人页面,按年份列出他们的统计数据 我能够利用传球表,在我的第一组球员四分卫身上成功地实施这一过程。但是,当我试图重新创建进程以获取运行数据时,我在数据框中收到了一个附加列,其值为Unnamed:x_level_0。这是我第一次使用HTML数据,所以我不确定我错过了什么,我只是假设它与四分卫的代码相同 以下是QB代码示例和正确的数据帧: 此输出如下所示: Philip Rivers Year

我正在制作一个屏幕刮板,从中提取足球统计数据。我目前正在从主要玩家的统计页面中抓取数据,然后进入他们的个人页面,按年份列出他们的统计数据

我能够利用传球表,在我的第一组球员四分卫身上成功地实施这一过程。但是,当我试图重新创建进程以获取运行数据时,我在数据框中收到了一个附加列,其值为Unnamed:x_level_0。这是我第一次使用HTML数据,所以我不确定我错过了什么,我只是假设它与四分卫的代码相同

以下是QB代码示例和正确的数据帧:

此输出如下所示:

Philip Rivers
Year      2020
Age         39
Tm         IND
Pos         qb
No.         17
G            1
GS           1
Unnamed: 0_level_0   Year       2020
Unnamed: 1_level_0   Age          26
Unnamed: 2_level_0   Tm          TEN
Unnamed: 3_level_0   Pos          rb
Unnamed: 4_level_0   No.          22
Games                G             1
                     GS            1
Rushing              Rush         31
                     Yds         116
                     TD            0
下面是RB代码示例和不正确的数据帧:

此输出如下所示:

Philip Rivers
Year      2020
Age         39
Tm         IND
Pos         qb
No.         17
G            1
GS           1
Unnamed: 0_level_0   Year       2020
Unnamed: 1_level_0   Age          26
Unnamed: 2_level_0   Tm          TEN
Unnamed: 3_level_0   Pos          rb
Unnamed: 4_level_0   No.          22
Games                G             1
                     GS            1
Rushing              Rush         31
                     Yds         116
                     TD            0
从中提取此数据的示例URL为:

它是在拉急和拉接。当涉及到解析HTML时,是否还有其他需要注意的问题

我试图将index\u col=1添加到我的tdf=pd.read\u htmlur+stub[1]中。然而,这只是将两个值组合成一列

如果您对此有任何意见,我们将不胜感激。如果我能提供任何进一步的信息,请让我知道


谢谢

您可以尝试使用此代码解析每个玩家的表传递,现在我从中获取玩家,但您可以将任何玩家URL传递给它:

import requests 
from bs4 import BeautifulSoup


def scrape_player(player_name, player_url, year="2020"):
    out = []

    soup = BeautifulSoup(requests.get(player_url).content, 'html.parser')

    row = soup.select_one('table#passing tr:has(th:contains("{}"))'.format(year))
    if row:
        tds = [player_name] + [t.text for t in row.select('th, td')]
        headers = ['Name'] + [th.text for th in row.find_previous('thead').select('th')]
        out.append(dict(zip(headers, tds)))

    return out

url = 'https://www.pro-football-reference.com/years/2020/passing.htm'
all_data = []
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for player in soup.select('table#passing [data-stat="player"] a'):
    print(player.text)
    for data in scrape_player(player.text, 'https://www.pro-football-reference.com' + player['href']):
        all_data.append(data)

df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)
创建此csv:

编辑:要分析发送和接收,可以使用以下脚本:

import requests 
from bs4 import BeautifulSoup, Comment


def scrape_player(player_name, player_url, year="2020"):
    out = []

    soup = BeautifulSoup(requests.get(player_url).content, 'html.parser')
    soup = BeautifulSoup(soup.select_one('#rushing_and_receiving_link').find_next(text=lambda t: isinstance(t, Comment)), 'html.parser')

    row = soup.select_one('table#rushing_and_receiving tr:has(th:contains("{}"))'.format(year))
    if row:
        tds = [player_name] + [t.text for t in row.select('th, td')]
        headers = ['Name'] + [th.text for th in row.find_previous('thead').select('tr')[-1].select('th')]
        out.append(dict(zip(headers, tds)))

    return out

url = 'https://www.pro-football-reference.com/years/2020/passing.htm'
all_data = []
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for player in soup.select('table#passing [data-stat="player"] a'):
    print(player.text)
    for data in scrape_player(player.text, 'https://www.pro-football-reference.com' + player['href']):
        all_data.append(data)

df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)
创建此CSV:


非常感谢你!这是如此的高效和工作就像一个梦想。如果你不介意的话,你能证实我对这是如何工作的理解吗?在scrape_player中,对于行变量,我设置了要提取的数据的标记,然后更改url和soup中的变量。根据HTML中的标记,选择一个正确的状态作为传递、冲刺、接收等状态。这太不可思议了@MCJNY1992是的,每个表都有唯一的ID,所以对其他表进行更改传递时,它的工作原理应该是相同的。我是否要使用之间的值?Rush&Receiving名为Rush&Receiving,它为无效字符创建了一个Python错误。@MCJNY1992请参见我的编辑如何解析Rush&Receiving表实际的表存储在HTML注释中,因此您必须从中解析它。我明白了,我将在以后研究它。我切换到不再阅读冲刺表,而不再阅读传球表,并在刮球球员中添加了Try/Except。奇怪的是,在我的数据框中,我仍然只能得到符合传球表要求的四分卫。我可以看到它在跑动台的跑位中循环,但不知何故,我的数据框最终只能成为传球台的球员。