Python 刮网球成绩表,包括每一排的比赛

Python 刮网球成绩表,包括每一排的比赛,python,web-scraping,beautifulsoup,python-requests,Python,Web Scraping,Beautifulsoup,Python Requests,我想从该页面中获取匹配结果: 根据收集到的结果,我想创建一个包含以下列的表:锦标赛、日期、match_player_1、match_player_2、round、score 我创建了一个代码,它可以工作,但我不知道如何将竞争添加到每个比赛行 import requests from bs4 import BeautifulSoup u = 'https://www.tennisexplorer.com/player/paire-4a33b/' r = requests.get(u, time

我想从该页面中获取匹配结果:

根据收集到的结果,我想创建一个包含以下列的表:锦标赛、日期、match_player_1、match_player_2、round、score 我创建了一个代码,它可以工作,但我不知道如何将竞争添加到每个比赛行

import requests
from bs4 import BeautifulSoup

u = 'https://www.tennisexplorer.com/player/paire-4a33b/'

r = requests.get(u, timeout=120, headers=headers)
# print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')

for tr in soup.select('#matches-2020-1-data tr'):
    match_date = tr.select_one('td:nth-of-type(1)').get_text(strip=True)
    match_surface = tr.select_one('td:nth-of-type(2)').get_text(strip=True)
    match = tr.select_one('td:nth-of-type(3)').get_text(strip=True)
#...
我需要创建这样的表:

tournament                      date    match_player_1  match_player_2  round   score
Cincinnati Masters (New York)   22.08.  Coric B.        Paire B.        1R      6-0, 1-0
Ultimate Tennis Showdown 2      01.08.  Moutet C.       Paire B.        NaN     15-0, 15-0, 15-0, 15-0

如何将锦标赛与每场比赛关联以获得所需的数据帧,您可以执行以下操作:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://www.tennisexplorer.com/player/paire-4a33b/'
soup = BeautifulSoup( requests.get(url).content, 'html.parser' )

all_data = []
for row in soup.select('#matches-2020-1-data tr:not(:has(th))'):
    tds = [td.get_text(strip=True, separator=' ') for td in row.select('td')]
    all_data.append({
        'tournament': row.find_previous('tr', class_='head flags').find('td').get_text(strip=True),
        'date': tds[0],
        'match_player_1': tds[2].split('-')[0].strip(),
        'match_player_2': tds[2].split('-')[-1].strip(),
        'round': tds[3],
        'score': tds[4]
        })

df = pd.DataFrame(all_data)
df.to_csv('data.csv')
保存
data.csv
(来自LibreOffice的屏幕截图):

试试看:

import pandas as pd

url = "https://www.tennisexplorer.com/player/paire-4a33b/"

df = pd.read_html(url)[8]
new_data = {"tournament":[], "date":[], "match_player_1":[], "match_player_2":[],
                                 "round":[], "score":[]}
for index, row in df.iterrows():
    try:
        date = float(row.iloc[0][:-1])
        new_data["tournament"].append(tourn)
        new_data["date"].append(row.iloc[0])
        new_data["match_player_1"].append(row.iloc[2].split("-")[0])
        new_data["match_player_2"].append(row.iloc[2].split("-")[1])
        new_data["round"].append(row.iloc[3])
        new_data["score"].append(row.iloc[4])
        
    except Exception as e:
        tourn = row.iloc[0]

data = pd.DataFrame(new_data)