Python 如何使用BeautifulSoup排除表中的某些行?
代码运行良好,但是,我试图获取表的URL似乎在整个表中重复了标题,我不确定如何处理这个问题并删除这些行,因为我试图将数据导入BigQuery,并且存在某些不允许的字符Python 如何使用BeautifulSoup排除表中的某些行?,python,beautifulsoup,google-bigquery,Python,Beautifulsoup,Google Bigquery,代码运行良好,但是,我试图获取表的URL似乎在整个表中重复了标题,我不确定如何处理这个问题并删除这些行,因为我试图将数据导入BigQuery,并且存在某些不允许的字符 URL = 'https://www.basketball-reference.com/leagues/NBA_2020_games-august.html' driver = webdriver.Chrome(chrome_options=chrome_options) driver.get(URL) soup = Beaut
URL = 'https://www.basketball-reference.com/leagues/NBA_2020_games-august.html'
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(URL)
soup = BeautifulSoup(driver.page_source,'html')
driver.quit()
tables = soup.find_all('table',{"id":["schedule"]})
table = tables[0]
tab_data = [[cell.text for cell in row.find_all(["th","td"])]
for row in table.find_all("tr")]
json_string = ''
headers = [col.replace('.', '_').replace('/', '_').replace('%', 'pct').replace('3', '_3').replace('(', '_').replace(')', '_') for col in tab_data[1]]
for row in tab_data[2:]:
json_string += json.dumps(dict(zip(headers, row))) + '\n'
with open('example.json', 'w') as f:
f.write(json_string)
print(json_string)
您可以将tr行的类设置为“无”,这样就不会得到重复的头
下面的代码从表中创建一个数据帧
from bs4 import BeautifulSoup
import requests
import pandas as pd
res = requests.get("https://www.basketball-reference.com/leagues/NBA_2020_games-august.html")
soup = BeautifulSoup(res.text, "html.parser")
table = soup.find("div", {"id":"div_schedule"}).find("table")
columns = [i.get_text() for i in table.find("thead").find_all('th')]
data = []
for tr in table.find('tbody').find_all('tr', class_=False):
temp = [tr.find('th').get_text(strip=True)]
temp.extend([i.get_text(strip=True) for i in tr.find_all("td")])
data.append(temp)
df = pd.DataFrame(data, columns = columns)
print(df)
输出:
Date Start (ET) Visitor/Neutral PTS Home/Neutral PTS Attend. Notes
0 Sat, Aug 1, 2020 1:00p Miami Heat 125 Denver Nuggets 105 Box Score
1 Sat, Aug 1, 2020 3:30p Utah Jazz 94 Oklahoma City Thunder 110 Box Score
2 Sat, Aug 1, 2020 6:00p New Orleans Pelicans 103 Los Angeles Clippers 126 Box Score
3 Sat, Aug 1, 2020 7:00p Philadelphia 76ers 121 Indiana Pacers 127 Box Score
4 Sat, Aug 1, 2020 8:30p Los Angeles Lakers 92 Toronto Raptors 107 Box Score
.. ... ... ... ... ... ... ... .. ... ...
75 Thu, Aug 13, 2020 Portland Trail Blazers Brooklyn Nets
76 Fri, Aug 14, 2020 Philadelphia 76ers Houston Rockets
77 Fri, Aug 14, 2020 Miami Heat Indiana Pacers
78 Fri, Aug 14, 2020 Oklahoma City Thunder Los Angeles Clippers
79 Fri, Aug 14, 2020 Denver Nuggets Toronto Raptors
[80 rows x 10 columns]
为了插入到bigquery,您可以使用直接将json插入到bigquery或将数据帧插入到bigquery。您能告诉我如何像我的示例一样将其保存在json中,并从标题中删除不兼容的字符吗?您能发布您正在寻找的json结构吗?我的系统中没有selenium,基本上就像这个json_string=headers=[col.replace'.',''.''.''/'.''.''.''.replace''.'替换''3','.''.''.'''.'替换'.''.''.'''.'.'''.''.'一样,对于tab_数据[2:]中的行,替换为tab_数据[1]]中的列,json_string+=json.dumpsdictzipheaders,用open example.json'.\n'替换',“w”作为f:f.writejson_字符串,我想从中删除不兼容的字符header@anthony你能举个例子吗?不是代码,实际的jsonWell BigQuery只接受新行分隔的JSON,这意味着每行一个完整的JSON对象。我不知道它应该是什么样子!