Python 如何在读取批处理csv文件时动态添加缺少的列
我有12个csv文件要在一个输出数据帧中读取。我希望在最终输出数据帧中显示的列分布在多个文件中。例如,如下所示 文件1-8中的列列表Python 如何在读取批处理csv文件时动态添加缺少的列,python,pandas,dataframe,numpy,series,Python,Pandas,Dataframe,Numpy,Series,我有12个csv文件要在一个输出数据帧中读取。我希望在最终输出数据帧中显示的列分布在多个文件中。例如,如下所示 文件1-8中的列列表 person_ID, Test_CODE, REGISTRATION_DATE, subject_CD, subject_DESCRIPTION, subject_TYPE 文件9中的列列表 person_ID, Test_CODE, REGISTRATION_DATE, subject_Code, subject_DESCRIPTION, subject_In
person_ID, Test_CODE, REGISTRATION_DATE, subject_CD, subject_DESCRIPTION, subject_TYPE
文件9中的列列表
person_ID, Test_CODE, REGISTRATION_DATE, subject_Code, subject_DESCRIPTION, subject_Indicator
文件10-12中的列列表
person_ID, Test_CODE, START_DATE, END_DATE, subject_Code, subject_DESCRIPTION, subject_Indicator
根据我对领域的理解,我知道START\u DATE
和REGISTRATION\u DATE
列的含义相同
类似地,subject\u CD
和subject\u code
的含义相同
因此,我在下面的帮助下尝试重命名这些列
dfs = []
for f in files:
df = pd.read_excel(f, sep=",",low_memory=False)
print(df.columns)
df1 = df[df.columns.intersection(['person_ID','Test_CODE','REGISTRATION_DATE','subject_CD','subject_DESCRIPTION'])].rename(columns={'subject_CD':'subject_Code','REGISTRATION_DATE':'START_DATE'})
dfs.append(df1)
但是,我不确定如何动态添加列
,因为文件1-9
缺失结束日期
。我只想有一个列END\u DATE
,里面没有数据。只有当我有列END\u DATE
,我才能附加所有输入数据帧并获得最终输出数据帧
或者可以基于公共列附加一个数据帧,只需在最终输出数据帧(附加后)中添加一个END\u DATE
列
我希望我的最终数据帧具有如下所示的列
来自最终输出数据帧的列列表
person_ID, Test_CODE, START_DATE, END_DATE, subject_Code, subject_DESCRIPTION
我认为您可以首先使用
重命名
,然后对于在列表中传递的仅返回列,如果DataFrame中不存在列表中的列被追加,并由缺少的值填充:
d = {'subject_CD':'subject_Code','REGISTRATION_DATE':'START_DATE'}
cols = ['person_ID','Test_CODE','START_DATE','END_DATE',
'subject_Code','subject_DESCRIPTION']
dfs = []
for f in files:
df = pd.read_excel(f, sep=",",low_memory=False)
print(df.columns)
df1 = df.rename(columns=d).reindex(columns=cols)
dfs.append(df1)
列出备选方案:
dfs = [pd.read_excel(f, sep=",",low_memory=False).rename(columns=d).reindex(columns=cols)
for f in files]
测试数据:
print (df1)
person_ID Test_CODE REGISTRATION_DATE subject_CD subject_DESCRIPTION \
0 id1 aa 2015-01-01 sub1 desc
subject_TYPE
0 type1
print (df2)
person_ID Test_CODE REGISTRATION_DATE subject_Code subject_DESCRIPTION \
0 id2 bb 2017-01-01 sub1 desc2
subject_Indica
0 type2
print (df3)
person_ID Test_CODE START_DATE END_DATE subject_Code \
0 id3 cc 2017-01-01 2017-08-06 sub3
subject_DESCRIPTION subject_Indicator
0 desc3 type3
如果数据中不存在
cols
列表中列出的列,我相信该列将返回NA
。对不起,我无法尝试,因为我不在我的办公室desk@TheGreat-当然,它会返回NaN
s,这样就可以将大量的测试数据重新索引到解决方案中,以便查看。
d = {'subject_CD':'subject_Code','REGISTRATION_DATE':'START_DATE'}
cols = ['person_ID','Test_CODE','START_DATE','END_DATE',
'subject_Code','subject_DESCRIPTION']
dfs = []
for df in [df1, df2, df3]:
# df = pd.read_excel(f, sep=",",low_memory=False)
#print(df.columns)
df1 = df.rename(columns=d).reindex(columns=cols)
dfs.append(df1)
df = pd.concat(dfs, ignore_index=True)
print (df)
person_ID Test_CODE START_DATE END_DATE subject_Code subject_DESCRIPTION
0 id1 aa 2015-01-01 NaN sub1 desc
1 id2 bb 2017-01-01 NaN sub1 desc2
2 id3 cc 2017-01-01 2017-08-06 sub3 desc3