Python 从HTML表中抓取足球数据

Python 从HTML表中抓取足球数据,python,pandas,selenium,web-scraping,Python,Pandas,Selenium,Web Scraping,我需要从该网站的HTML表格中提取赔率数据: 我想提取每场比赛的赔率问题是,每场比赛都是在2行(开放和关闭) 我创建了这段代码,但返回了一个空数据帧 from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expe

我需要从该网站的HTML表格中提取赔率数据:

我想提取每场比赛的赔率问题是,每场比赛都是在2行(开放和关闭)

我创建了这段代码,但返回了一个空数据帧

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs
import pandas as pd
import copy
import numpy as np
import time

results = []


d = webdriver.Chrome(executable_path = r'C:\chromedriver.exe')

u = "http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1"

d.get(u)
WebDriverWait(d, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#main > div.pl_right > table")))

soup = bs(d.page_source, 'lxml')
rows = soup.select('#main > div.pl_right > table')

headers = ['Comp', 'Time', 'Match' ,'Odds', 'H','D', 'A', 'Res']
i = 1
for row in rows[1:]:    
    cols = [td.text for td in row.select('td')]

    if (i % 2 == 1):
        record = {'Comp' : cols[0],
                  'Time' : cols[1],
                  'Match' : ' v '.join([cols[2], cols[10]]),
                  'Odds' : 'op',
                  'H' : cols[3],
                  'D' : cols[4],
                  'A' : cols[5],
                  'Res' : cols[11]}
    else:
        record['Odds'] = 'cl'
        record['H'] = cols[0] 
        record['D'] = cols[1] 
        record['A'] = cols[2]
    results.append(copy.deepcopy(record))
    i+=1

df = pd.DataFrame(results, columns = headers)
d.quit()

在表的bs4 CSS选择器中发现错误

soup.选择('main>div.pl\u right>table>tbody>tr')

我查看了您的代码,发现有许多情况/条件没有得到处理


就像您没有处理日期标签一样。

此脚本提取表格并将信息放入列表中:

import re
import requests
from bs4 import BeautifulSoup


url = 'http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

all_data = []
for tr in soup.select('.schedule tr[id^="tr"]')[::2]:
    row1 = [td.get_text(strip=True) for td in tr.select('td')]
    row2 = [td.get_text(strip=True) for td in tr.find_next('tr').select('td')]
    #extract date form <script> tag:
    row1[1] = re.findall(r'\d+,\d+,\d+(?=\))', tr.select('td')[1].script.contents[0])[0]

    row1 = row1[:3] + row1[3:10] + row2 + row1[10:-1]
    all_data.append(row1)

# print on screen:
from pprint import pprint
pprint(all_data, width=250)

如果pandas可以使用
.read\u html()
为您解析表,那么这将是一项非常艰巨的工作。它在引擎盖下使用BeautifulSoup

另外,我假设开放的是第一行,封闭的是第二行。因此,这只是按偶数/奇数索引值进行切片的问题:

import pandas as pd
import requests

url = 'http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}

response = requests.get(url, headers=headers)
df = pd.read_html(response.text, header=0)[0]

evenRows = list(df.index)[::2]
oddRows = list(df.index)[1::2]

open_df = df.take(evenRows)
close_df = df.take(oddRows)
输出:

print (open_df.head(10).to_string())
    League                             Time              Home    HW     D     AW     HWR      DR     AWR  Return                    Away Score
0     KFAC  showtime(2020,06-1,06,04,00,00)  Gyeongju Citizen  1.94  3.48   3.57  47.60%  26.54%  25.87%  92.34%      Pyeongtaek Citizen   0-0
2     KFAC  showtime(2020,06-1,06,05,00,00)   Paju Citizen FC  2.87  3.09   2.43  32.16%  29.87%  37.98%  92.30%                Gimpo FC   2-2
4     KFAC  showtime(2020,06-1,06,06,00,00)         FC Anyang  1.26  5.18   9.07  72.35%  17.60%  10.05%  91.16%               Goyang FC   2-0
6     KFAC  showtime(2020,06-1,06,06,00,00)       Jeju United  1.09  8.40  20.12  84.46%  10.96%   4.58%  92.06%                 Songwol   4-0
8     KFAC  showtime(2020,06-1,06,06,00,00)   Jeonnam Dragons  1.14  6.76  14.22  80.08%  13.50%   6.42%  91.29%         Chungju Citizen   2-0
10    KFAC  showtime(2020,06-1,06,07,00,00)       Hwaseong FC  2.71  3.14   2.53  34.08%  29.41%  36.51%  92.36%          Daejeon Korail   2-2
12    KFAC  showtime(2020,06-1,06,07,00,00)        Suwon City  1.13  7.00  14.95  80.84%  13.05%   6.11%  91.35%             Hyochang FC  10-0
14  KOR D1  showtime(2020,06-1,06,07,30,00)          FC Seoul  4.24  3.39   1.95  22.60%  28.26%  49.14%  95.82%  Jeonbuk Hyundai Motors   1-4
16  INT CF  showtime(2020,06-1,06,08,00,00)   Bohemians1905 B  1.96  4.27   3.27  48.58%  22.30%  29.12%  95.22%         Slavia Prague B   0-5
18  INT CF  showtime(2020,06-1,06,08,00,00)             Sepsi  2.03  3.24   3.21  44.27%  27.74%  28.00%  89.87%      Chindia Targoviste   2-1
....
或者,它看起来像是要填充表格并输入
'op'
'cl'
,只需对代码稍加修改:

import pandas as pd
import requests

url = 'http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}

response = requests.get(url, headers=headers)
df = pd.read_html(response.text, header=0)[0]
df = df.drop(['Compare'],axis=1)
df['Odds'] = 'op'
df.loc[1::2,'Odds'] = 'cl'

你能帮我吗help@KhaledKoubaa我建议您按照此[链接]在脚本中应用一些python
pdb
调试。并逐步检查所有条件。是否存在特定问题?您是否进行了任何调试?请参阅。
import pandas as pd
import requests

url = 'http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}

response = requests.get(url, headers=headers)
df = pd.read_html(response.text, header=0)[0]
df = df.drop(['Compare'],axis=1)
df['Odds'] = 'op'
df.loc[1::2,'Odds'] = 'cl'