Python 如何合并两个数据帧,但没有共享元素
我正在用Python抓取一些NBA数据。我有以下脚本Python 如何合并两个数据帧,但没有共享元素,python,pandas,web-scraping,Python,Pandas,Web Scraping,我正在用Python抓取一些NBA数据。我有以下脚本 def scrape_data(): #URL url = "https://basketball-reference.com/leagues/NBA_2020_advanced.html" html = urlopen(url) soup = bs(html, 'html.parser') soup.findAll('tr', limit = 2) headers = [th.getText(
def scrape_data():
#URL
url = "https://basketball-reference.com/leagues/NBA_2020_advanced.html"
html = urlopen(url)
soup = bs(html, 'html.parser')
soup.findAll('tr', limit = 2)
headers = [th.getText() for th in soup.findAll('tr', limit = 2)[0].findAll('th')]
headers = headers[1:]
rows = soup.findAll('tr')[1:]
player_stats = [[td.getText() for td in rows[i].findAll('td')]for i in range(len(rows))]
stats = pd.DataFrame(player_stats, columns=headers)
stats.head(10)
return stats
哪个返回这个
Player Pos Age Tm G ... OBPM DBPM BPM VORP
0 Steven Adams C 26 OKC 43 ... 1.6 3.3 4.9 2.0
1 Bam Adebayo PF 22 MIA 47 ... 1.2 3.8 5.0 2.8
2 LaMarcus Aldridge C 34 SAS 43 ... 1.7 0.6 2.4 1.6
3 Nickeil Alexander-Walker SG 21 NOP 38 ... -3.4 -2.3 -5.6 -0.4
4 Grayson Allen SG 24 MEM 30 ... -0.7 -2.8 -3.5 -0.2
.. ... .. .. ... .. ... .. ... ... ... ...
537 Thaddeus Young PF 31 CHI 49 ... -2.2 0.9 -1.3 0.2
538 Trae Young PG 21 ATL 44 ... 7.8 -2.3 5.5 2.9
539 Cody Zeller C 27 CHO 45 ... 0.0 -0.6 -0.6 0.4
540 Ante Žižić C 23 CLE 16 ... -2.3 -1.4 -3.6 -0.1
541 Ivica Zubac C 22 LAC 48 ... 0.4 2.3 2.7 1.0
我想刮取第二个url,其中表的格式与此完全相同,并将此表中的玩家统计信息附加到另一个表中,如果这样做有意义的话。问题是,在第二个url上,两个表上都有一些统计信息。当我“合并”两个表时,我不想再次添加这些表>如何执行此操作?我想您应该使用drop\u duplicates()。下面是一个简化的示例:
import pandas as pd
df = pd.DataFrame([["foo", "bar"],["foo2", "bar2"],["foo3", "bar3"]], columns=["first_column", "second_column"])
df2 = pd.DataFrame([["foo3", "bar4"],["foo4", "bar5"],["foo5", "bar6"]], columns=["first_column", "second_column"])
print(pd.concat([df, df2], ignore_index=True).drop_duplicates(subset="first_column"))
输出:
first_column second_column
0 foo bar
1 foo2 bar2
2 foo3 bar3
4 foo4 bar5
5 foo5 bar6
如您所见,第二个数据帧中的“foo3”行被过滤掉,因为它已经包含在第一个数据帧中
在您的情况下,您可以使用以下内容:
pd.concat([stats, stats2], ignore_index=True).drop_duplicates(subset="Player"))
您需要做大量工作才能将
标记放入表中。让熊猫帮你吧(它在引擎盖下使用BeautifulSoup)。然后,要进行合并,有两种方法:
1) 使其中一个数据帧只包含另一个数据帧中不包含的内容(但是,保留要进行合并的列)
2) 从数据帧中的第二个数据帧中删除列(同样,请确保不要删除要进行合并的列)
import pandas as pd
def scrape_data(url):
stats = pd.read_html(url)[0]
return stats
df1 = scrape_data("https://basketball-reference.com/leagues/NBA_2020_advanced.html")
df1 = df1[df1['Rk'] != 'Rk']
df2 = scrape_data("https://basketball-reference.com/leagues/NBA_2020_per_poss.html")
df2 = df2[df2['Rk'] != 'Rk']
uniqueCols = [ col for col in df2.columns if col not in df1.columns ]
# Below will do the same as above line
#uniqueCols = list(df2.columns.difference(df1.columns))
df2 = df2[uniqueCols + ['Player', 'Tm']]
df = df1.merge(df2, how='left', on=['Player', 'Tm'])
或
当你说两个表上都有一些统计数据时,你是说两个表中有完全相同的行(即所有列中都相同),还是说它们具有相同的唯一键(可能是播放器列)?@cron相同的唯一键,例如,两个表都显示了一个玩家PPG,我不希望它显示twices意味着删除第一个数据帧中的第二个数据帧中的列。不过,您至少需要一个公共列才能合并(可能是玩家名称…但如果有同名玩家,请小心。您可能需要在其他键/列上合并)
import pandas as pd
def scrape_data(url):
stats = pd.read_html(url)[0]
return stats
df1 = scrape_data("https://basketball-reference.com/leagues/NBA_2020_advanced.html")
df1 = df1[df1['Rk'] != 'Rk']
df2 = scrape_data("https://basketball-reference.com/leagues/NBA_2020_per_poss.html")
df2 = df2[df2['Rk'] != 'Rk']
dropCols = [ col for col in df1.columns if col in df2.columns and col not in ['Player','Tm']]
df2 = df2.drop(dropCols, axis=1)
df = df1.merge(df2, how='left', on=['Player', 'Tm'])