Python 如何在读取批处理csv文件时动态添加缺少的列

Python 如何在读取批处理csv文件时动态添加缺少的列,python,pandas,dataframe,numpy,series,Python,Pandas,Dataframe,Numpy,Series,我有12个csv文件要在一个输出数据帧中读取。我希望在最终输出数据帧中显示的列分布在多个文件中。例如,如下所示 文件1-8中的列列表 person_ID, Test_CODE, REGISTRATION_DATE, subject_CD, subject_DESCRIPTION, subject_TYPE 文件9中的列列表 person_ID, Test_CODE, REGISTRATION_DATE, subject_Code, subject_DESCRIPTION, subject_In

我有12个csv文件要在一个输出数据帧中读取。我希望在最终输出数据帧中显示的列分布在多个文件中。例如,如下所示

文件1-8中的列列表

person_ID, Test_CODE, REGISTRATION_DATE, subject_CD, subject_DESCRIPTION, subject_TYPE
文件9中的列列表

person_ID, Test_CODE, REGISTRATION_DATE, subject_Code, subject_DESCRIPTION, subject_Indicator
文件10-12中的列列表

person_ID, Test_CODE, START_DATE, END_DATE, subject_Code, subject_DESCRIPTION, subject_Indicator
根据我对领域的理解,我知道
START\u DATE
REGISTRATION\u DATE
列的含义相同

类似地,
subject\u CD
subject\u code
的含义相同

因此,我在下面的帮助下尝试重命名这些列

dfs = []       
for f in files:
    df = pd.read_excel(f, sep=",",low_memory=False)
    print(df.columns)
    df1 = df[df.columns.intersection(['person_ID','Test_CODE','REGISTRATION_DATE','subject_CD','subject_DESCRIPTION'])].rename(columns={'subject_CD':'subject_Code','REGISTRATION_DATE':'START_DATE'})
    dfs.append(df1)
但是,我不确定
如何动态添加列
,因为
文件1-9
缺失
结束日期
。我只想有一个列
END\u DATE
,里面没有数据。只有当我有列
END\u DATE
,我才能附加所有输入数据帧并获得最终输出数据帧

或者可以基于公共列附加一个数据帧,只需在最终输出数据帧(附加后)中添加一个
END\u DATE

我希望我的最终数据帧具有如下所示的列

来自最终输出数据帧的列列表

person_ID, Test_CODE, START_DATE, END_DATE, subject_Code, subject_DESCRIPTION

我认为您可以首先使用
重命名
,然后对于在列表中传递的仅返回列,如果DataFrame中不存在列表中的列被追加,并由缺少的值填充:

d = {'subject_CD':'subject_Code','REGISTRATION_DATE':'START_DATE'}
cols = ['person_ID','Test_CODE','START_DATE','END_DATE',
        'subject_Code','subject_DESCRIPTION']

dfs = []       
for f in files:
    df = pd.read_excel(f, sep=",",low_memory=False)
    print(df.columns)
    df1 = df.rename(columns=d).reindex(columns=cols)
    dfs.append(df1)
列出备选方案:

dfs = [pd.read_excel(f, sep=",",low_memory=False).rename(columns=d).reindex(columns=cols) 
       for f in files]
测试数据:

print (df1)
  person_ID Test_CODE REGISTRATION_DATE subject_CD subject_DESCRIPTION  \
0       id1        aa        2015-01-01       sub1                desc   

  subject_TYPE  
0        type1 

print (df2)
  person_ID Test_CODE REGISTRATION_DATE subject_Code subject_DESCRIPTION  \
0       id2        bb        2017-01-01         sub1               desc2   

  subject_Indica  
0          type2 

print (df3)
  person_ID Test_CODE  START_DATE    END_DATE subject_Code  \
0       id3        cc  2017-01-01  2017-08-06         sub3   

  subject_DESCRIPTION subject_Indicator  
0               desc3             type3 


如果数据中不存在
cols
列表中列出的列,我相信该列将返回
NA
。对不起,我无法尝试,因为我不在我的办公室desk@TheGreat-当然,它会返回
NaN
s,这样就可以将大量的测试数据重新索引到解决方案中,以便查看。
d = {'subject_CD':'subject_Code','REGISTRATION_DATE':'START_DATE'}
cols = ['person_ID','Test_CODE','START_DATE','END_DATE',
        'subject_Code','subject_DESCRIPTION']


dfs = []       
for df in [df1, df2, df3]:
    # df = pd.read_excel(f, sep=",",low_memory=False)
    #print(df.columns)
    df1 = df.rename(columns=d).reindex(columns=cols)
    dfs.append(df1)
    
df = pd.concat(dfs, ignore_index=True)
print (df)
  person_ID Test_CODE  START_DATE    END_DATE subject_Code subject_DESCRIPTION
0       id1        aa  2015-01-01         NaN         sub1                desc
1       id2        bb  2017-01-01         NaN         sub1               desc2
2       id3        cc  2017-01-01  2017-08-06         sub3               desc3