Python 匹配存储在另一个数据帧中的列名并替换为其ID_Python_Pandas_Data Manipulation

Python 匹配存储在另一个数据帧中的列名并替换为其ID

python pandas

Python 匹配存储在另一个数据帧中的列名并替换为其ID,python,pandas,data-manipulation,Python,Pandas,Data Manipulation,我有一个名为master的主数据框架，它包含所有问题的ID。我有多个包含这些问题的数据集作为标题，我想用它们的id替换这些标题主表如下所示： Question ID gender 1 sex 1 what is your gender 1 sexual orientation 1 marital status 2 occupation 3

我有一个名为master的主数据框架，它包含所有问题的ID。我有多个包含这些问题的数据集作为标题，我想用它们的id替换这些标题

主表如下所示：

Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3

gender         marital status  occupation

Male           Single          Doctor
Male           Divorced        Engineer

gender         marital status  country

Male           Single          India
Male           Divorced        UK

1                 2              4

Male           Single          India
Male           Divorced        UK

Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3
country                4

df1看起来像这样：

Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3

gender         marital status  occupation

Male           Single          Doctor
Male           Divorced        Engineer

gender         marital status  country

Male           Single          India
Male           Divorced        UK

1                 2              4

Male           Single          India
Male           Divorced        UK

Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3
country                4

期望输出

   1            2                 3                 

   Male        Single            Doctor
   Male        Divorced          Engineer

此外，如果df1中的任何新变量在主数据表中没有提到id，则应为其提供一个新id，并且变量名称和id将在主表中更新

例如

df2看起来像这样：

Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3

gender         marital status  occupation

Male           Single          Doctor
Male           Divorced        Engineer

gender         marital status  country

Male           Single          India
Male           Divorced        UK

1                 2              4

Male           Single          India
Male           Divorced        UK

Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3
country                4

所需df2:

Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3

gender         marital status  occupation

Male           Single          Doctor
Male           Divorced        Engineer

gender         marital status  country

Male           Single          India
Male           Divorced        UK

1                 2              4

Male           Single          India
Male           Divorced        UK

Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3
country                4

更新后的主表将为：

Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3

gender         marital status  occupation

Male           Single          Doctor
Male           Divorced        Engineer

gender         marital status  country

Male           Single          India
Male           Divorced        UK

1                 2              4

Male           Single          India
Male           Divorced        UK

Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3
country                4

使用by

Series

通过其他数据设置新列名称：

df2 = df1.rename(columns=df.set_index('Question')['ID'])
print (df2)
      1         2         3
0  Male    Single    Doctor
1  Male  Divorced  Engineer

编辑：

df

中的

Question

值中存在重复项，因此需要创建唯一的

Question

值。一种可能的解决方案是通过删除重复项，以下是示例数据，以了解其工作原理：

print (df)
              Question  ID
0               gender  10 <-duplicates, change ID for test
1               gender  15 <-duplicates, change ID for test
2  what is your gender   1
3   sexual orientation   1
4       marital status   2
5           occupation   3
6        whats you job   3

删除重复项并保留第一行重复项，此处

ID=10

：

print (df.drop_duplicates('Question').set_index('Question')['ID'])
Question
gender                 10
what is your gender     1
sexual orientation      1
marital status          2
occupation              3
whats you job           3
Name: ID, dtype: int64

df21 = df1.rename(columns=df.drop_duplicates('Question').set_index('Question')['ID'])
print (df21)
     10        2         3 
0  Male    Single    Doctor
1  Male  Divorced  Engineer

删除重复项并保留第一行重复项，此处

ID=15

：

print (df.drop_duplicates('Question', keep='last').set_index('Question')['ID'])
Question
gender                 15
what is your gender     1
sexual orientation      1
marital status          2
occupation              3
whats you job           3
Name: ID, dtype: int64

df22 = df1.rename(columns=df.drop_duplicates('Question', keep='last').set_index('Question')['ID'])
print (df22)
     15        2         3 
0  Male    Single    Doctor
1  Male  Divorced  Engineer


print (df.set_index('Question')['ID'].to_dict())
{'gender': 15, 'what is your gender': 1, 'sexual orientation': 1, 'marital status': 2, 'occupation': 3, 'whats you job': 3}



df22 = df1.rename(columns=df.set_index('Question')['ID'].to_dict())
print (df22)
     15        2         3 
0  Male    Single    Doctor
1  Male  Divorced  Engineer

EDIT1:如果主数据帧中的值不存在且有必要先附加它们，请使用：

print (df)
              Question  ID
0               gender   1
1                  sex   1
2  what is your gender   1
3   sexual orientation   1
4       marital status   2
5           occupation   3
6        whats you job   3

print (df1) 
  gender marital status country  code1  code2
0   Male         Single   India      4      7
1   Male       Divorced      UK      3      5

获取

df['Question']

中不存在的所有列：

cols = df1.columns.difference(df['Question'].tolist(), sort=False)
print (cols)
Index(['country', 'code1', 'code2'], dtype='object')

按最大值添加

ID

下一步：

df3 = pd.DataFrame({'Question':cols, 
                    'ID': np.arange(df['ID'].max() + 1, len(cols) + df['ID'].max() + 1)})
print (df3) 
  Question  ID
0  country   4
1    code1   5
2    code2   6

附加到原始

主数据帧：
df = pd.concat([df, df3], ignore_index=True)
print (df)
              Question  ID
0               gender   1
1                  sex   1
2  what is your gender   1
3   sexual orientation   1
4       marital status   2
5           occupation   3
6        whats you job   3
7              country   4
8                code1   5
9                code2   6

上次使用原始解决方案：
df2 = df1.rename(columns=df.set_index('Question')['ID'])
print (df2)
      1         2      4  5  6
0  Male    Single  India  4  7
1  Male  Divorced     UK  3  5

您可以使用匹配问题的ID进行重命名：
df1.columns = [int(master[master.Question==c]['ID'].values) for c in df1.columns]

这应该适用于给定列的多个可能的名称。
其他共享相同ID的问题如何，如性别、性别、，那么性取向呢？@JayPeerachai所以基本上在另一个数据集中，如果任何变量带有任何名称，例如性别或性取向，那么它应该被id=1替换，因为所有这些变量都指向同一个东西。@HemantSain-你能测试df2=df1.rename（columns=df.set_index（'Question'）['id']to_dict（））
@HemantSain-答案已编辑以获得可能的解决方案。@jezrael是的，如果有多个重复项具有相同的ID
，则没有问题，否则需要预处理+1:）。@jezrael非常感谢，如果你能帮我解决问题的第二部分，那就太好了well@HemantSain-不明白，python如何知道国家
是4
？这意味着因为主表中的最后一个值是3
，所以使用3+1=4
？