Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/qt/7.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 匹配存储在另一个数据帧中的列名并替换为其ID_Python_Pandas_Data Manipulation - Fatal编程技术网

Python 匹配存储在另一个数据帧中的列名并替换为其ID

Python 匹配存储在另一个数据帧中的列名并替换为其ID,python,pandas,data-manipulation,Python,Pandas,Data Manipulation,我有一个名为master的主数据框架,它包含所有问题的ID。 我有多个包含这些问题的数据集作为标题,我想用它们的id替换这些标题 主表如下所示: Question ID gender 1 sex 1 what is your gender 1 sexual orientation 1 marital status 2 occupation 3

我有一个名为master的主数据框架,它包含所有问题的ID。 我有多个包含这些问题的数据集作为标题,我想用它们的id替换这些标题

主表如下所示:

Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3
gender         marital status  occupation

Male           Single          Doctor
Male           Divorced        Engineer
gender         marital status  country

Male           Single          India
Male           Divorced        UK
1                 2              4

Male           Single          India
Male           Divorced        UK
Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3
country                4
df1看起来像这样:

Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3
gender         marital status  occupation

Male           Single          Doctor
Male           Divorced        Engineer
gender         marital status  country

Male           Single          India
Male           Divorced        UK
1                 2              4

Male           Single          India
Male           Divorced        UK
Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3
country                4
期望输出

   1            2                 3                 

   Male        Single            Doctor
   Male        Divorced          Engineer
此外,如果df1中的任何新变量在主数据表中没有提到id,则应为其提供一个新id,并且变量名称和id将在主表中更新

例如

df2看起来像这样:

Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3
gender         marital status  occupation

Male           Single          Doctor
Male           Divorced        Engineer
gender         marital status  country

Male           Single          India
Male           Divorced        UK
1                 2              4

Male           Single          India
Male           Divorced        UK
Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3
country                4
所需df2:

Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3
gender         marital status  occupation

Male           Single          Doctor
Male           Divorced        Engineer
gender         marital status  country

Male           Single          India
Male           Divorced        UK
1                 2              4

Male           Single          India
Male           Divorced        UK
Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3
country                4
更新后的主表将为:

Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3
gender         marital status  occupation

Male           Single          Doctor
Male           Divorced        Engineer
gender         marital status  country

Male           Single          India
Male           Divorced        UK
1                 2              4

Male           Single          India
Male           Divorced        UK
Question               ID

gender                 1
sex                    1
what is your gender    1
sexual orientation     1
marital status         2
occupation             3
whats you job          3
country                4
使用by
Series
通过其他数据设置新列名称:

df2 = df1.rename(columns=df.set_index('Question')['ID'])
print (df2)
      1         2         3
0  Male    Single    Doctor
1  Male  Divorced  Engineer
编辑:

df
中的
Question
值中存在重复项,因此需要创建唯一的
Question
值。一种可能的解决方案是通过删除重复项,以下是示例数据,以了解其工作原理:

print (df)
              Question  ID
0               gender  10 <-duplicates, change ID for test
1               gender  15 <-duplicates, change ID for test
2  what is your gender   1
3   sexual orientation   1
4       marital status   2
5           occupation   3
6        whats you job   3

删除重复项并保留第一行重复项,此处
ID=10

print (df.drop_duplicates('Question').set_index('Question')['ID'])
Question
gender                 10
what is your gender     1
sexual orientation      1
marital status          2
occupation              3
whats you job           3
Name: ID, dtype: int64

df21 = df1.rename(columns=df.drop_duplicates('Question').set_index('Question')['ID'])
print (df21)
     10        2         3 
0  Male    Single    Doctor
1  Male  Divorced  Engineer
删除重复项并保留第一行重复项,此处
ID=15

print (df.drop_duplicates('Question', keep='last').set_index('Question')['ID'])
Question
gender                 15
what is your gender     1
sexual orientation      1
marital status          2
occupation              3
whats you job           3
Name: ID, dtype: int64

df22 = df1.rename(columns=df.drop_duplicates('Question', keep='last').set_index('Question')['ID'])
print (df22)
     15        2         3 
0  Male    Single    Doctor
1  Male  Divorced  Engineer


print (df.set_index('Question')['ID'].to_dict())
{'gender': 15, 'what is your gender': 1, 'sexual orientation': 1, 'marital status': 2, 'occupation': 3, 'whats you job': 3}



df22 = df1.rename(columns=df.set_index('Question')['ID'].to_dict())
print (df22)
     15        2         3 
0  Male    Single    Doctor
1  Male  Divorced  Engineer
EDIT1:如果主数据帧中的值不存在且有必要先附加它们,请使用:

print (df)
              Question  ID
0               gender   1
1                  sex   1
2  what is your gender   1
3   sexual orientation   1
4       marital status   2
5           occupation   3
6        whats you job   3

print (df1) 
  gender marital status country  code1  code2
0   Male         Single   India      4      7
1   Male       Divorced      UK      3      5
获取
df['Question']
中不存在的所有列:

cols = df1.columns.difference(df['Question'].tolist(), sort=False)
print (cols)
Index(['country', 'code1', 'code2'], dtype='object')
按最大值添加
ID
下一步:

df3 = pd.DataFrame({'Question':cols, 
                    'ID': np.arange(df['ID'].max() + 1, len(cols) + df['ID'].max() + 1)})
print (df3) 
  Question  ID
0  country   4
1    code1   5
2    code2   6
附加到原始
主数据帧

df = pd.concat([df, df3], ignore_index=True)
print (df)
              Question  ID
0               gender   1
1                  sex   1
2  what is your gender   1
3   sexual orientation   1
4       marital status   2
5           occupation   3
6        whats you job   3
7              country   4
8                code1   5
9                code2   6
上次使用原始解决方案:

df2 = df1.rename(columns=df.set_index('Question')['ID'])
print (df2)
      1         2      4  5  6
0  Male    Single  India  4  7
1  Male  Divorced     UK  3  5

您可以使用匹配问题的ID进行重命名:

df1.columns = [int(master[master.Question==c]['ID'].values) for c in df1.columns]

这应该适用于给定列的多个可能的名称。

其他共享相同ID的问题如何,如性别、性别、,那么性取向呢?@JayPeerachai所以基本上在另一个数据集中,如果任何变量带有任何名称,例如性别或性取向,那么它应该被id=1替换,因为所有这些变量都指向同一个东西。@HemantSain-你能测试
df2=df1.rename(columns=df.set_index('Question')['id']to_dict())
@HemantSain-答案已编辑以获得可能的解决方案。@jezrael是的,如果有多个重复项具有相同的
ID
,则没有问题,否则需要预处理+1:)。@jezrael非常感谢,如果你能帮我解决问题的第二部分,那就太好了well@HemantSain-不明白,python如何知道
国家
4
?这意味着因为
主表中的最后一个值是
3
,所以使用
3+1=4