Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/359.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何合并两个数据帧,但没有共享元素_Python_Pandas_Web Scraping - Fatal编程技术网

Python 如何合并两个数据帧,但没有共享元素

Python 如何合并两个数据帧,但没有共享元素,python,pandas,web-scraping,Python,Pandas,Web Scraping,我正在用Python抓取一些NBA数据。我有以下脚本 def scrape_data(): #URL url = "https://basketball-reference.com/leagues/NBA_2020_advanced.html" html = urlopen(url) soup = bs(html, 'html.parser') soup.findAll('tr', limit = 2) headers = [th.getText(

我正在用Python抓取一些NBA数据。我有以下脚本

def scrape_data():
    #URL
    url = "https://basketball-reference.com/leagues/NBA_2020_advanced.html"
    html = urlopen(url)
    soup = bs(html, 'html.parser')
    soup.findAll('tr', limit = 2)
    headers = [th.getText() for th in soup.findAll('tr', limit = 2)[0].findAll('th')]
    headers = headers[1:]
    rows = soup.findAll('tr')[1:]
    player_stats = [[td.getText() for td in rows[i].findAll('td')]for i in range(len(rows))]
    stats = pd.DataFrame(player_stats, columns=headers)
    stats.head(10)
    return stats
哪个返回这个

                       Player Pos Age   Tm   G  ...     OBPM  DBPM   BPM  VORP
0                Steven Adams   C  26  OKC  43  ...      1.6   3.3   4.9   2.0
1                 Bam Adebayo  PF  22  MIA  47  ...      1.2   3.8   5.0   2.8
2           LaMarcus Aldridge   C  34  SAS  43  ...      1.7   0.6   2.4   1.6
3    Nickeil Alexander-Walker  SG  21  NOP  38  ...     -3.4  -2.3  -5.6  -0.4
4               Grayson Allen  SG  24  MEM  30  ...     -0.7  -2.8  -3.5  -0.2
..                        ...  ..  ..  ...  ..  ... ..   ...   ...   ...   ...
537            Thaddeus Young  PF  31  CHI  49  ...     -2.2   0.9  -1.3   0.2
538                Trae Young  PG  21  ATL  44  ...      7.8  -2.3   5.5   2.9
539               Cody Zeller   C  27  CHO  45  ...      0.0  -0.6  -0.6   0.4
540                Ante Žižić   C  23  CLE  16  ...     -2.3  -1.4  -3.6  -0.1
541               Ivica Zubac   C  22  LAC  48  ...      0.4   2.3   2.7   1.0

我想刮取第二个url,其中表的格式与此完全相同,并将此表中的玩家统计信息附加到另一个表中,如果这样做有意义的话。问题是,在第二个url上,两个表上都有一些统计信息。当我“合并”两个表时,我不想再次添加这些表>如何执行此操作?

我想您应该使用drop\u duplicates()。下面是一个简化的示例:

import pandas as pd

df = pd.DataFrame([["foo", "bar"],["foo2", "bar2"],["foo3", "bar3"]], columns=["first_column", "second_column"])
df2 = pd.DataFrame([["foo3", "bar4"],["foo4", "bar5"],["foo5", "bar6"]], columns=["first_column", "second_column"])

print(pd.concat([df, df2], ignore_index=True).drop_duplicates(subset="first_column"))
输出:

  first_column second_column
0          foo           bar
1         foo2          bar2
2         foo3          bar3
4         foo4          bar5
5         foo5          bar6
如您所见,第二个数据帧中的“foo3”行被过滤掉,因为它已经包含在第一个数据帧中

在您的情况下,您可以使用以下内容:

pd.concat([stats, stats2], ignore_index=True).drop_duplicates(subset="Player"))

您需要做大量工作才能将
标记放入表中。让熊猫帮你吧(它在引擎盖下使用BeautifulSoup)。然后,要进行合并,有两种方法:

1) 使其中一个数据帧只包含另一个数据帧中不包含的内容(但是,保留要进行合并的列)

2) 从数据帧中的第二个数据帧中删除列(同样,请确保不要删除要进行合并的列)

import pandas as pd

def scrape_data(url):
    stats = pd.read_html(url)[0]
    return stats


df1 = scrape_data("https://basketball-reference.com/leagues/NBA_2020_advanced.html")
df1 = df1[df1['Rk'] != 'Rk']

df2 = scrape_data("https://basketball-reference.com/leagues/NBA_2020_per_poss.html")
df2 = df2[df2['Rk'] != 'Rk']

uniqueCols = [ col for col in df2.columns if col not in df1.columns ]

# Below will do the same as above line
#uniqueCols = list(df2.columns.difference(df1.columns))

df2 = df2[uniqueCols + ['Player', 'Tm']]

df = df1.merge(df2, how='left', on=['Player', 'Tm'])


当你说两个表上都有一些统计数据时,你是说两个表中有完全相同的行(即所有列中都相同),还是说它们具有相同的唯一键(可能是播放器列)?@cron相同的唯一键,例如,两个表都显示了一个玩家PPG,我不希望它显示twices意味着删除第一个数据帧中的第二个数据帧中的列。不过,您至少需要一个公共列才能合并(可能是玩家名称…但如果有同名玩家,请小心。您可能需要在其他键/列上合并)
import pandas as pd

def scrape_data(url):
    stats = pd.read_html(url)[0]
    return stats


df1 = scrape_data("https://basketball-reference.com/leagues/NBA_2020_advanced.html")
df1 = df1[df1['Rk'] != 'Rk']

df2 = scrape_data("https://basketball-reference.com/leagues/NBA_2020_per_poss.html")
df2 = df2[df2['Rk'] != 'Rk']

dropCols = [ col for col in df1.columns if col in df2.columns and col not in ['Player','Tm']]
df2 = df2.drop(dropCols, axis=1)

df = df1.merge(df2, how='left', on=['Player', 'Tm'])