Python3可以组合或合并具有相似数据的列
我有一个数据框,我正试图用性别列更新性别列Python3可以组合或合并具有相似数据的列,python,merge,pandas,Python,Merge,Pandas,我有一个数据框,我正试图用性别列更新性别列 import pandas as pd import numpy as np df=pd.DataFrame({'Users': [ 'Al Gore', 'Ned Flonders', 'Kim jong un', 'Al Sharpton', 'Michele', 'Richard Johnson', 'Taylor Swift', 'Alf pig', 'Dick Johnson', 'Dana Jovy'],
import pandas as pd
import numpy as np
df=pd.DataFrame({'Users': [ 'Al Gore', 'Ned Flonders', 'Kim jong un', 'Al Sharpton', 'Michele', 'Richard Johnson', 'Taylor Swift', 'Alf pig', 'Dick Johnson', 'Dana Jovy'],
'Gender': [np.nan,'Male','Male','Male',np.nan,np.nan, 'Female',np.nan,'Male','Female'],
'Sex': ['M',np.nan,np.nan,'M','F',np.nan, 'F',np.nan,np.nan,'F']})
输出
>>>
Gender Sex Users
0 NaN M Al Gore
1 Male NaN Ned Flonders
2 Male NaN Kim jong un
3 Male M Al Sharpton
4 NaN F Michele
5 NaN NaN Richard Johnson
6 Female F Taylor Swift
7 NaN NaN Alf pig
8 Male NaN Dick Johnson
9 Female F Dana Jovy
[10 rows x 3 columns]
因此,如果在“性别”栏中是男性,则在“性别”栏中显示为M
到目前为止,我的尝试如下:
df['Sex2']=(df.Gender.isin(['Male']).map({True:'M',False:''}) +
df.Sex.isin(['M']).map({True:'M',False:''}) +
df.Sex.isin(['F']).map({True:'F',False:''})+
df.Gender.isin(['Female']).map({True:'F',False:''}))
print(df)
输出
[10 rows x 3 columns]
Gender Sex Users Sex2
0 NaN M Al Gore M
1 Male NaN Ned Flonders M
2 Male NaN Kim jong un M
3 Male M Al Sharpton MM
4 NaN F Michele F
5 NaN NaN Richard Johnson
6 Female F Taylor Swift FF
7 NaN NaN Alf pig
8 Male NaN Dick Johnson M
9 Female F Dana Jovy FF
[10 rows x 4 columns]
我差一点就拿到了,但这可能不是很有效
这是我想要的输出
>>>
Gender Sex Users
0 NaN M Al Gore
1 Male M Ned Flonders
2 Male M Kim jong un
3 Male M Al Sharpton
4 NaN F Michele
5 NaN NaN Richard Johnson
6 Female F Taylor Swift
7 NaN NaN Alf pig
8 Male M Dick Johnson
9 Female F Dana Jovy
[10 rows x 3 columns]
是否可以使用某些合并或更新功能来执行此操作?使用map
:
In [14]:
import pandas as pd
import numpy as np
df=pd.DataFrame({'Users': [ 'Al Gore', 'Ned Flonders', 'Kim jong un', 'Al Sharpton', 'Michele', 'Richard Johnson', 'Taylor Swift', 'Alf pig', 'Dick Johnson', 'Dana Jovy'],
'Gender': [np.nan,'Male','Male','Male',np.nan,np.nan, 'Female',np.nan,'Male','Female'],
'Sex': ['M',np.nan,np.nan,'M','F',np.nan, 'F',np.nan,np.nan,'F']})
In [15]:
df
Out[15]:
Gender Sex Users
0 NaN M Al Gore
1 Male NaN Ned Flonders
2 Male NaN Kim jong un
3 Male M Al Sharpton
4 NaN F Michele
5 NaN NaN Richard Johnson
6 Female F Taylor Swift
7 NaN NaN Alf pig
8 Male NaN Dick Johnson
9 Female F Dana Jovy
[10 rows x 3 columns]
In [16]:
# create a sex dict
sex_map = {'Male':'M', 'Female':'F'}
# update only those where sex is NaN, apply map to gender to fill in values
df.loc[df.Sex.isnull(),'Sex'] = df['Gender'].map(sex_map)
df
Out[16]:
Gender Sex Users
0 NaN M Al Gore
1 Male M Ned Flonders
2 Male M Kim jong un
3 Male M Al Sharpton
4 NaN F Michele
5 NaN NaN Richard Johnson
6 Female F Taylor Swift
7 NaN NaN Alf pig
8 Male M Dick Johnson
9 Female F Dana Jovy
[10 rows x 3 columns]
比较性能:
In [21]:
%timeit df['Sex2']=(df.Gender.isin(['Male']).map({True:'M',False:''}) + df.Sex.isin(['M']).map({True:'M',False:''}) + df.Sex.isin(['F']).map({True:'F',False:''})+ df.Gender.isin(['Female']).map({True:'F',False:''}))
100 loops, best of 3: 2.38 ms per loop
In [24]:
%timeit df.loc[df.Sex.isnull(),'Sex'] = df['Gender'].map(sex_map)
1000 loops, best of 3: 1.21 ms per loop
In [27]:
# without the NaN mask which is similar to what you are doing
%timeit df['Sex'] = df['Gender'].map(sex_map)
1000 loops, best of 3: 531 µs per loop
因此,在这个小样本上,它更快,对于更大的数据帧,它应该更快,因为它使用cython谢谢Ed,有没有办法使它不区分大小写?你可以使用函数而不是dict,首先是小写/大写,或者只是在dict中添加不同的组合,只要您不希望有太多的变体。@ccsv我添加了另一个示例,其中我们没有布尔掩蔽,只设置了sex列,速度快了近5倍,因此,我认为,如果您可以确保存在一致的文本字符串,或者在地图中添加额外的键,那么如果您担心混合情况,将优化此方法