Python 匹配存储在另一个数据帧中的列名并替换为其ID
我有一个名为master的主数据框架,它包含所有问题的ID。 我有多个包含这些问题的数据集作为标题,我想用它们的id替换这些标题 主表如下所示:Python 匹配存储在另一个数据帧中的列名并替换为其ID,python,pandas,data-manipulation,Python,Pandas,Data Manipulation,我有一个名为master的主数据框架,它包含所有问题的ID。 我有多个包含这些问题的数据集作为标题,我想用它们的id替换这些标题 主表如下所示: Question ID gender 1 sex 1 what is your gender 1 sexual orientation 1 marital status 2 occupation 3
Question ID
gender 1
sex 1
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
gender marital status occupation
Male Single Doctor
Male Divorced Engineer
gender marital status country
Male Single India
Male Divorced UK
1 2 4
Male Single India
Male Divorced UK
Question ID
gender 1
sex 1
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
country 4
df1看起来像这样:
Question ID
gender 1
sex 1
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
gender marital status occupation
Male Single Doctor
Male Divorced Engineer
gender marital status country
Male Single India
Male Divorced UK
1 2 4
Male Single India
Male Divorced UK
Question ID
gender 1
sex 1
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
country 4
期望输出
1 2 3
Male Single Doctor
Male Divorced Engineer
此外,如果df1中的任何新变量在主数据表中没有提到id,则应为其提供一个新id,并且变量名称和id将在主表中更新
例如
df2看起来像这样:
Question ID
gender 1
sex 1
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
gender marital status occupation
Male Single Doctor
Male Divorced Engineer
gender marital status country
Male Single India
Male Divorced UK
1 2 4
Male Single India
Male Divorced UK
Question ID
gender 1
sex 1
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
country 4
所需df2:
Question ID
gender 1
sex 1
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
gender marital status occupation
Male Single Doctor
Male Divorced Engineer
gender marital status country
Male Single India
Male Divorced UK
1 2 4
Male Single India
Male Divorced UK
Question ID
gender 1
sex 1
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
country 4
更新后的主表将为:
Question ID
gender 1
sex 1
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
gender marital status occupation
Male Single Doctor
Male Divorced Engineer
gender marital status country
Male Single India
Male Divorced UK
1 2 4
Male Single India
Male Divorced UK
Question ID
gender 1
sex 1
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
country 4
使用bySeries
通过其他数据设置新列名称:
df2 = df1.rename(columns=df.set_index('Question')['ID'])
print (df2)
1 2 3
0 Male Single Doctor
1 Male Divorced Engineer
编辑:
df
中的Question
值中存在重复项,因此需要创建唯一的Question
值。一种可能的解决方案是通过删除重复项,以下是示例数据,以了解其工作原理:
print (df)
Question ID
0 gender 10 <-duplicates, change ID for test
1 gender 15 <-duplicates, change ID for test
2 what is your gender 1
3 sexual orientation 1
4 marital status 2
5 occupation 3
6 whats you job 3
删除重复项并保留第一行重复项,此处
ID=10
:
print (df.drop_duplicates('Question').set_index('Question')['ID'])
Question
gender 10
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
Name: ID, dtype: int64
df21 = df1.rename(columns=df.drop_duplicates('Question').set_index('Question')['ID'])
print (df21)
10 2 3
0 Male Single Doctor
1 Male Divorced Engineer
删除重复项并保留第一行重复项,此处ID=15
:
print (df.drop_duplicates('Question', keep='last').set_index('Question')['ID'])
Question
gender 15
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
Name: ID, dtype: int64
df22 = df1.rename(columns=df.drop_duplicates('Question', keep='last').set_index('Question')['ID'])
print (df22)
15 2 3
0 Male Single Doctor
1 Male Divorced Engineer
print (df.set_index('Question')['ID'].to_dict())
{'gender': 15, 'what is your gender': 1, 'sexual orientation': 1, 'marital status': 2, 'occupation': 3, 'whats you job': 3}
df22 = df1.rename(columns=df.set_index('Question')['ID'].to_dict())
print (df22)
15 2 3
0 Male Single Doctor
1 Male Divorced Engineer
EDIT1:如果主数据帧中的值不存在且有必要先附加它们,请使用:
print (df)
Question ID
0 gender 1
1 sex 1
2 what is your gender 1
3 sexual orientation 1
4 marital status 2
5 occupation 3
6 whats you job 3
print (df1)
gender marital status country code1 code2
0 Male Single India 4 7
1 Male Divorced UK 3 5
获取df['Question']
中不存在的所有列:
cols = df1.columns.difference(df['Question'].tolist(), sort=False)
print (cols)
Index(['country', 'code1', 'code2'], dtype='object')
按最大值添加ID
下一步:
df3 = pd.DataFrame({'Question':cols,
'ID': np.arange(df['ID'].max() + 1, len(cols) + df['ID'].max() + 1)})
print (df3)
Question ID
0 country 4
1 code1 5
2 code2 6
附加到原始主数据帧:
df = pd.concat([df, df3], ignore_index=True)
print (df)
Question ID
0 gender 1
1 sex 1
2 what is your gender 1
3 sexual orientation 1
4 marital status 2
5 occupation 3
6 whats you job 3
7 country 4
8 code1 5
9 code2 6
上次使用原始解决方案:
df2 = df1.rename(columns=df.set_index('Question')['ID'])
print (df2)
1 2 4 5 6
0 Male Single India 4 7
1 Male Divorced UK 3 5
您可以使用匹配问题的ID进行重命名:
df1.columns = [int(master[master.Question==c]['ID'].values) for c in df1.columns]
这应该适用于给定列的多个可能的名称。其他共享相同ID的问题如何,如性别、性别、,那么性取向呢?@JayPeerachai所以基本上在另一个数据集中,如果任何变量带有任何名称,例如性别或性取向,那么它应该被id=1替换,因为所有这些变量都指向同一个东西。@HemantSain-你能测试df2=df1.rename(columns=df.set_index('Question')['id']to_dict())
@HemantSain-答案已编辑以获得可能的解决方案。@jezrael是的,如果有多个重复项具有相同的ID
,则没有问题,否则需要预处理+1:)。@jezrael非常感谢,如果你能帮我解决问题的第二部分,那就太好了well@HemantSain-不明白,python如何知道国家
是4
?这意味着因为主表中的最后一个值是3
,所以使用3+1=4
?