Python 如何使用BeautifulSoup排除表中的某些行?

Python 如何使用BeautifulSoup排除表中的某些行?,python,beautifulsoup,google-bigquery,Python,Beautifulsoup,Google Bigquery,代码运行良好,但是,我试图获取表的URL似乎在整个表中重复了标题,我不确定如何处理这个问题并删除这些行,因为我试图将数据导入BigQuery,并且存在某些不允许的字符 URL = 'https://www.basketball-reference.com/leagues/NBA_2020_games-august.html' driver = webdriver.Chrome(chrome_options=chrome_options) driver.get(URL) soup = Beaut

代码运行良好,但是,我试图获取表的URL似乎在整个表中重复了标题,我不确定如何处理这个问题并删除这些行,因为我试图将数据导入BigQuery,并且存在某些不允许的字符

URL = 'https://www.basketball-reference.com/leagues/NBA_2020_games-august.html'

driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(URL)
soup = BeautifulSoup(driver.page_source,'html')
driver.quit()
tables = soup.find_all('table',{"id":["schedule"]})

table = tables[0]
tab_data = [[cell.text for cell in row.find_all(["th","td"])]
                        for row in table.find_all("tr")]
json_string = ''
headers = [col.replace('.', '_').replace('/', '_').replace('%', 'pct').replace('3', '_3').replace('(', '_').replace(')', '_') for col in tab_data[1]]
for row in tab_data[2:]:
    json_string += json.dumps(dict(zip(headers, row))) + '\n'
with open('example.json', 'w') as f:
    f.write(json_string)

    print(json_string)
您可以将tr行的类设置为“无”,这样就不会得到重复的头

下面的代码从表中创建一个数据帧

from bs4 import BeautifulSoup
import requests
import pandas as pd

res = requests.get("https://www.basketball-reference.com/leagues/NBA_2020_games-august.html")

soup = BeautifulSoup(res.text, "html.parser")
table = soup.find("div", {"id":"div_schedule"}).find("table")
columns = [i.get_text() for i in table.find("thead").find_all('th')]

data = []

for tr in table.find('tbody').find_all('tr', class_=False):
    temp = [tr.find('th').get_text(strip=True)]
    temp.extend([i.get_text(strip=True) for i in tr.find_all("td")])
    data.append(temp)

df = pd.DataFrame(data, columns = columns)

print(df)
输出:

                Date Start (ET)         Visitor/Neutral  PTS           Home/Neutral  PTS               Attend. Notes
0    Sat, Aug 1, 2020      1:00p              Miami Heat  125         Denver Nuggets  105  Box Score
1    Sat, Aug 1, 2020      3:30p               Utah Jazz   94  Oklahoma City Thunder  110  Box Score
2    Sat, Aug 1, 2020      6:00p    New Orleans Pelicans  103   Los Angeles Clippers  126  Box Score
3    Sat, Aug 1, 2020      7:00p      Philadelphia 76ers  121         Indiana Pacers  127  Box Score
4    Sat, Aug 1, 2020      8:30p      Los Angeles Lakers   92        Toronto Raptors  107  Box Score
..                ...        ...                     ...  ...                    ...  ...        ... ..     ...   ...
75  Thu, Aug 13, 2020             Portland Trail Blazers               Brooklyn Nets
76  Fri, Aug 14, 2020                 Philadelphia 76ers             Houston Rockets
77  Fri, Aug 14, 2020                         Miami Heat              Indiana Pacers
78  Fri, Aug 14, 2020              Oklahoma City Thunder        Los Angeles Clippers
79  Fri, Aug 14, 2020                     Denver Nuggets             Toronto Raptors

[80 rows x 10 columns]

为了插入到bigquery,您可以使用

直接将json插入到bigquery或将数据帧插入到bigquery。您能告诉我如何像我的示例一样将其保存在json中,并从标题中删除不兼容的字符吗?您能发布您正在寻找的json结构吗?我的系统中没有selenium,基本上就像这个json_string=headers=[col.replace'.',''.''.''/'.''.''.''.replace''.'替换''3','.''.''.'''.'替换'.''.''.'''.'.'''.''.'一样,对于tab_数据[2:]中的行,替换为tab_数据[1]]中的列,json_string+=json.dumpsdictzipheaders,用open example.json'.\n'替换',“w”作为f:f.writejson_字符串,我想从中删除不兼容的字符header@anthony你能举个例子吗?不是代码,实际的jsonWell BigQuery只接受新行分隔的JSON,这意味着每行一个完整的JSON对象。我不知道它应该是什么样子!