Python 组合数据帧

Python 组合数据帧,python,pandas,Python,Pandas,此代码包含以下内容: import pandas as pd import numpy as np import matplotlib.pyplot as plt import pickle java = pickle.load(open('JavaSafe.p','rb')) ##import 2d array python = pickle.load(open('PythonSafe.p','rb')) ##import 2d array javaFrame = pd.DataFrame

此代码包含以下内容:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle

java = pickle.load(open('JavaSafe.p','rb')) ##import 2d array
python = pickle.load(open('PythonSafe.p','rb')) ##import 2d array

javaFrame = pd.DataFrame(java,columns=['Town','Java Jobs'])
pythonFrame = pd.DataFrame(python,columns=['Town','Python Jobs'])
javaFrame = javaFrame.sort_values(by='Java Jobs',ascending=False)
pythonFrame = pythonFrame.sort_values(by='Python Jobs',ascending=False)
print(javaFrame,"\n",pythonFrame)
我想创建一个新的数据框架,它使用城镇名称作为索引,并且每个java和python都有一列。然而,一些城镇只会有一种语言的结果

                Town  Java Jobs
435          York,NY       3593
212       NewYork,NY       3585
584       Seattle,WA       2080
624       Chicago,IL       1920
301        Boston,MA       1571
...
79        Holland,MI          5
38      Manhattan,KS          5
497        Vernon,IL          5
30        Clayton,MO          5
90       Waukegan,IL          5

[653 rows x 2 columns] 

                 Town  Python Jobs
160       NewYork,NY         2949
11           York,NY         2938
349       Seattle,WA         1321
91        Chicago,IL         1312
167        Boston,MA         1117

383       Hanover,NH            5
209      Bulverde,TX            5
203     Salisbury,NC            5
67       Rockford,IL            5
256       Ventura,CA            5

[416 rows x 2 columns]

默认情况下,将在共享的所有列上连接两个数据帧。在本例中,
javaFrame
pythonFrame
只共享
Town
列。因此默认情况下,
pd.merge
将连接
Town
列上的两个数据帧


how='outer
导致
pd.merge
使用。换句话说,它导致
pd.merge
返回数据来自
javaFrame
pythonFrame
的行,即使只有一个数据帧包含
Town
。缺少的数据用
NaN
s填充

使用
pd.concat

import pandas as pd

javaFrame = pd.DataFrame({'Java Jobs': [3593, 3585, 2080, 1920, 1571, 5, 5, 5, 5, 5],
     'Town': ['York,NY', 'NewYork,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Holland,MI', 'Manhattan,KS', 'Vernon,IL', 'Clayton,MO', 'Waukegan,IL']}, index=[435, 212, 584, 624, 301, 79, 38, 497, 30, 90])
pythonFrame = pd.DataFrame({'Python Jobs': [2949, 2938, 1321, 1312, 1117, 5, 5, 5, 5, 5],
     'Town': ['NewYork,NY', 'York,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Hanover,NH', 'Bulverde,TX', 'Salisbury,NC', 'Rockford,IL', 'Ventura,CA']}, index=[160, 11, 349, 91, 167, 383, 209, 203, 67, 256])

result = pd.merge(javaFrame, pythonFrame, how='outer').set_index('Town')
#               Java Jobs  Python Jobs
# Town                                
# York,NY          3593.0       2938.0
# NewYork,NY       3585.0       2949.0
# Seattle,WA       2080.0       1321.0
# Chicago,IL       1920.0       1312.0
# Boston,MA        1571.0       1117.0
# Holland,MI          5.0          NaN
# Manhattan,KS        5.0          NaN
# Vernon,IL           5.0          NaN
# Clayton,MO          5.0          NaN
# Waukegan,IL         5.0          NaN
# Hanover,NH          NaN          5.0
# Bulverde,TX         NaN          5.0
# Salisbury,NC        NaN          5.0
# Rockford,IL         NaN          5.0
# Ventura,CA          NaN          5.0

result=pd.merge(javaFrame,pythonFrame,how='outer')。我认为,set_index('Town')
是他们所期望的!您也可以在给定原始代码
result=pd.merge(pythonFrame、javeFrame、on='Town',how='outer')的情况下执行此操作。设置索引('Town')
df = pd.concat([df.set_index('Town') for df in [javaFrame, pythonFrame]], axis=1)

              Java Jobs  Python Jobs
Boston,MA        1571.0       1117.0
Bulverde,TX         NaN          5.0
Chicago,IL       1920.0       1312.0
Clayton,MO          5.0          NaN
Hanover,NH          NaN          5.0
Holland,MI          5.0          NaN
Manhattan,KS        5.0          NaN
NewYork,NY       3585.0       2949.0
Rockford,IL         NaN          5.0
Salisbury,NC        NaN          5.0
Seattle,WA       2080.0       1321.0
Ventura,CA          NaN          5.0
Vernon,IL           5.0          NaN
Waukegan,IL         5.0          NaN
York,NY          3593.0       2938.0