Python 从HTML表中抓取足球数据
我需要从该网站的HTML表格中提取赔率数据: 我想提取每场比赛的赔率问题是,每场比赛都是在2行(开放和关闭) 我创建了这段代码,但返回了一个空数据帧Python 从HTML表中抓取足球数据,python,pandas,selenium,web-scraping,Python,Pandas,Selenium,Web Scraping,我需要从该网站的HTML表格中提取赔率数据: 我想提取每场比赛的赔率问题是,每场比赛都是在2行(开放和关闭) 我创建了这段代码,但返回了一个空数据帧 from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expe
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs
import pandas as pd
import copy
import numpy as np
import time
results = []
d = webdriver.Chrome(executable_path = r'C:\chromedriver.exe')
u = "http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1"
d.get(u)
WebDriverWait(d, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#main > div.pl_right > table")))
soup = bs(d.page_source, 'lxml')
rows = soup.select('#main > div.pl_right > table')
headers = ['Comp', 'Time', 'Match' ,'Odds', 'H','D', 'A', 'Res']
i = 1
for row in rows[1:]:
cols = [td.text for td in row.select('td')]
if (i % 2 == 1):
record = {'Comp' : cols[0],
'Time' : cols[1],
'Match' : ' v '.join([cols[2], cols[10]]),
'Odds' : 'op',
'H' : cols[3],
'D' : cols[4],
'A' : cols[5],
'Res' : cols[11]}
else:
record['Odds'] = 'cl'
record['H'] = cols[0]
record['D'] = cols[1]
record['A'] = cols[2]
results.append(copy.deepcopy(record))
i+=1
df = pd.DataFrame(results, columns = headers)
d.quit()
在表的bs4 CSS选择器中发现错误
soup.选择('main>div.pl\u right>table>tbody>tr')
我查看了您的代码,发现有许多情况/条件没有得到处理
就像您没有处理日期标签一样。此脚本提取表格并将信息放入列表中:
import re
import requests
from bs4 import BeautifulSoup
url = 'http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
all_data = []
for tr in soup.select('.schedule tr[id^="tr"]')[::2]:
row1 = [td.get_text(strip=True) for td in tr.select('td')]
row2 = [td.get_text(strip=True) for td in tr.find_next('tr').select('td')]
#extract date form <script> tag:
row1[1] = re.findall(r'\d+,\d+,\d+(?=\))', tr.select('td')[1].script.contents[0])[0]
row1 = row1[:3] + row1[3:10] + row2 + row1[10:-1]
all_data.append(row1)
# print on screen:
from pprint import pprint
pprint(all_data, width=250)
如果pandas可以使用
.read\u html()
为您解析表,那么这将是一项非常艰巨的工作。它在引擎盖下使用BeautifulSoup
另外,我假设开放的是第一行,封闭的是第二行。因此,这只是按偶数/奇数索引值进行切片的问题:
import pandas as pd
import requests
url = 'http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
response = requests.get(url, headers=headers)
df = pd.read_html(response.text, header=0)[0]
evenRows = list(df.index)[::2]
oddRows = list(df.index)[1::2]
open_df = df.take(evenRows)
close_df = df.take(oddRows)
输出:
print (open_df.head(10).to_string())
League Time Home HW D AW HWR DR AWR Return Away Score
0 KFAC showtime(2020,06-1,06,04,00,00) Gyeongju Citizen 1.94 3.48 3.57 47.60% 26.54% 25.87% 92.34% Pyeongtaek Citizen 0-0
2 KFAC showtime(2020,06-1,06,05,00,00) Paju Citizen FC 2.87 3.09 2.43 32.16% 29.87% 37.98% 92.30% Gimpo FC 2-2
4 KFAC showtime(2020,06-1,06,06,00,00) FC Anyang 1.26 5.18 9.07 72.35% 17.60% 10.05% 91.16% Goyang FC 2-0
6 KFAC showtime(2020,06-1,06,06,00,00) Jeju United 1.09 8.40 20.12 84.46% 10.96% 4.58% 92.06% Songwol 4-0
8 KFAC showtime(2020,06-1,06,06,00,00) Jeonnam Dragons 1.14 6.76 14.22 80.08% 13.50% 6.42% 91.29% Chungju Citizen 2-0
10 KFAC showtime(2020,06-1,06,07,00,00) Hwaseong FC 2.71 3.14 2.53 34.08% 29.41% 36.51% 92.36% Daejeon Korail 2-2
12 KFAC showtime(2020,06-1,06,07,00,00) Suwon City 1.13 7.00 14.95 80.84% 13.05% 6.11% 91.35% Hyochang FC 10-0
14 KOR D1 showtime(2020,06-1,06,07,30,00) FC Seoul 4.24 3.39 1.95 22.60% 28.26% 49.14% 95.82% Jeonbuk Hyundai Motors 1-4
16 INT CF showtime(2020,06-1,06,08,00,00) Bohemians1905 B 1.96 4.27 3.27 48.58% 22.30% 29.12% 95.22% Slavia Prague B 0-5
18 INT CF showtime(2020,06-1,06,08,00,00) Sepsi 2.03 3.24 3.21 44.27% 27.74% 28.00% 89.87% Chindia Targoviste 2-1
....
或者,它看起来像是要填充表格并输入'op'
和'cl'
,只需对代码稍加修改:
import pandas as pd
import requests
url = 'http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
response = requests.get(url, headers=headers)
df = pd.read_html(response.text, header=0)[0]
df = df.drop(['Compare'],axis=1)
df['Odds'] = 'op'
df.loc[1::2,'Odds'] = 'cl'
你能帮我吗help@KhaledKoubaa我建议您按照此[链接]在脚本中应用一些python
pdb
调试。并逐步检查所有条件。是否存在特定问题?您是否进行了任何调试?请参阅。
import pandas as pd
import requests
url = 'http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
response = requests.get(url, headers=headers)
df = pd.read_html(response.text, header=0)[0]
df = df.drop(['Compare'],axis=1)
df['Odds'] = 'op'
df.loc[1::2,'Odds'] = 'cl'