Python 将多个csv文件导入熊猫并将其合并到一个数据帧中

Python 将多个csv文件导入熊猫并将其合并到一个数据帧中,python,pandas,csv,dataframe,concat,Python,Pandas,Csv,Dataframe,Concat,I有多个csv文件(每个文件包含N行(例如,1000行)和43列) Client_ID Client_Name Pointer_of_Bins Date Weight C0000001 POLYGONE TI006093 12/03/2019 0.5 C0000001 POLYGONE TI006093 12/03

I有多个csv文件(每个文件包含N行(例如,1000行)和43列)

               Client_ID    Client_Name  Pointer_of_Bins   Date        Weight
                C0000001       POLYGONE      TI006093     12/03/2019   0.5
                C0000001       POLYGONE      TI006093     12/03/2019   0.6
                C0000001       POLYGONE      TI006093     12/03/2019   1.4
                C0000001       POLYGONE      TI006897     14/03/2019   2.9   
                C0000001       POLYGONE      TI006093     22/04/2019   1.5
                C0000001       ALDI          TI006098     22/04/2019   0.7
                C0000001       ALDI          TI006098     22/04/2019   2.4
                C0000001       ALDI          TI006898     24/04/2019   1.9                                                             
I希望将文件夹中的多个csv文件读取到pandas中,并将它们合并到一个数据帧中

               Client_ID    Client_Name  Pointer_of_Bins   Date        Weight
                C0000001       POLYGONE      TI006093     12/03/2019   0.5
                C0000001       POLYGONE      TI006093     12/03/2019   0.6
                C0000001       POLYGONE      TI006093     12/03/2019   1.4
                C0000001       POLYGONE      TI006897     14/03/2019   2.9   
                C0000001       POLYGONE      TI006093     22/04/2019   1.5
                C0000001       ALDI          TI006098     22/04/2019   0.7
                C0000001       ALDI          TI006098     22/04/2019   2.4
                C0000001       ALDI          TI006898     24/04/2019   1.9                                                             
不过,我还没有弄明白

               Client_ID    Client_Name  Pointer_of_Bins   Date        Weight
                C0000001       POLYGONE      TI006093     12/03/2019   0.5
                C0000001       POLYGONE      TI006093     12/03/2019   0.6
                C0000001       POLYGONE      TI006093     12/03/2019   1.4
                C0000001       POLYGONE      TI006897     14/03/2019   2.9   
                C0000001       POLYGONE      TI006093     22/04/2019   1.5
                C0000001       ALDI          TI006098     22/04/2019   0.7
                C0000001       ALDI          TI006098     22/04/2019   2.4
                C0000001       ALDI          TI006898     24/04/2019   1.9                                                             
问题在于,数据帧的最终输出(即
frame=pd.concat(li,axis=0,ignore_index=True)
)将所有列(即43列)合并为一列(见附图)

               Client_ID    Client_Name  Pointer_of_Bins   Date        Weight
                C0000001       POLYGONE      TI006093     12/03/2019   0.5
                C0000001       POLYGONE      TI006093     12/03/2019   0.6
                C0000001       POLYGONE      TI006093     12/03/2019   1.4
                C0000001       POLYGONE      TI006897     14/03/2019   2.9   
                C0000001       POLYGONE      TI006093     22/04/2019   1.5
                C0000001       ALDI          TI006098     22/04/2019   0.7
                C0000001       ALDI          TI006098     22/04/2019   2.4
                C0000001       ALDI          TI006898     24/04/2019   1.9                                                             
选定行和列的示例(文件一)

               Client_ID    Client_Name  Pointer_of_Bins   Date        Weight
                C0000001       POLYGONE      TI006093     12/03/2019   0.5
                C0000001       POLYGONE      TI006093     12/03/2019   0.6
                C0000001       POLYGONE      TI006093     12/03/2019   1.4
                C0000001       POLYGONE      TI006897     14/03/2019   2.9   
                C0000001       POLYGONE      TI006093     22/04/2019   1.5
                C0000001       ALDI          TI006098     22/04/2019   0.7
                C0000001       ALDI          TI006098     22/04/2019   2.4
                C0000001       ALDI          TI006898     24/04/2019   1.9                                                             
选定行和列的示例(文件二) 客户端ID客户端名称指针日期权重 C0000001 POLYGONE TI006093 2019年4月22日1.5 C0000001 ALDI TI006098 2019年4月22日0.7 C0000001 ALDI TI006098 2019年4月22日2.4 C0000001 ALDI TI006898 2019年4月24日1.9

               Client_ID    Client_Name  Pointer_of_Bins   Date        Weight
                C0000001       POLYGONE      TI006093     12/03/2019   0.5
                C0000001       POLYGONE      TI006093     12/03/2019   0.6
                C0000001       POLYGONE      TI006093     12/03/2019   1.4
                C0000001       POLYGONE      TI006897     14/03/2019   2.9   
                C0000001       POLYGONE      TI006093     22/04/2019   1.5
                C0000001       ALDI          TI006098     22/04/2019   0.7
                C0000001       ALDI          TI006098     22/04/2019   2.4
                C0000001       ALDI          TI006898     24/04/2019   1.9                                                             
预期的输出如下所示(合并可能包含数千行和多列的多个文件,因为附加数据只是一个示例,而实际的csv文件可能包含数千行和每个文件中超过45列)

               Client_ID    Client_Name  Pointer_of_Bins   Date        Weight
                C0000001       POLYGONE      TI006093     12/03/2019   0.5
                C0000001       POLYGONE      TI006093     12/03/2019   0.6
                C0000001       POLYGONE      TI006093     12/03/2019   1.4
                C0000001       POLYGONE      TI006897     14/03/2019   2.9   
                C0000001       POLYGONE      TI006093     22/04/2019   1.5
                C0000001       ALDI          TI006098     22/04/2019   0.7
                C0000001       ALDI          TI006098     22/04/2019   2.4
                C0000001       ALDI          TI006898     24/04/2019   1.9                                                             

               Client_ID    Client_Name  Pointer_of_Bins   Date        Weight
                C0000001       POLYGONE      TI006093     12/03/2019   0.5
                C0000001       POLYGONE      TI006093     12/03/2019   0.6
                C0000001       POLYGONE      TI006093     12/03/2019   1.4
                C0000001       POLYGONE      TI006897     14/03/2019   2.9   
                C0000001       POLYGONE      TI006093     22/04/2019   1.5
                C0000001       ALDI          TI006098     22/04/2019   0.7
                C0000001       ALDI          TI006098     22/04/2019   2.4
                C0000001       ALDI          TI006898     24/04/2019   1.9                                                             
以下是我迄今为止所做的工作:

               Client_ID    Client_Name  Pointer_of_Bins   Date        Weight
                C0000001       POLYGONE      TI006093     12/03/2019   0.5
                C0000001       POLYGONE      TI006093     12/03/2019   0.6
                C0000001       POLYGONE      TI006093     12/03/2019   1.4
                C0000001       POLYGONE      TI006897     14/03/2019   2.9   
                C0000001       POLYGONE      TI006093     22/04/2019   1.5
                C0000001       ALDI          TI006098     22/04/2019   0.7
                C0000001       ALDI          TI006098     22/04/2019   2.4
                C0000001       ALDI          TI006898     24/04/2019   1.9                                                             
import pandas as pd
import glob
path = r'C:\Users\alnaffakh\Desktop\doc\Data\data2\Test'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
    df = pd.read_csv(filename, sep='delimiter', index_col=None, header=0)
  # df = pd.read_csv(filename, sep='\t', index_col=None, header=0)
    li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)

解决方案 您可以使用递归连接
.csv
文件内容。
事实上,我看到您使用了它,您对
concat
的应用对我来说似乎很好。尝试调查您读取的各个数据帧。列可以合并为单个列的唯一方法是,如果没有提到正确的分隔符

               Client_ID    Client_Name  Pointer_of_Bins   Date        Weight
                C0000001       POLYGONE      TI006093     12/03/2019   0.5
                C0000001       POLYGONE      TI006093     12/03/2019   0.6
                C0000001       POLYGONE      TI006093     12/03/2019   1.4
                C0000001       POLYGONE      TI006897     14/03/2019   2.9   
                C0000001       POLYGONE      TI006093     22/04/2019   1.5
                C0000001       ALDI          TI006098     22/04/2019   0.7
                C0000001       ALDI          TI006098     22/04/2019   2.4
                C0000001       ALDI          TI006898     24/04/2019   1.9                                                             
将熊猫作为pd导入
dfs=列表()
对于文件名中的文件名:
df=pd.read\u csv(文件名)
dfs.append(df)
frame=pd.concat(dfs,轴=0,忽略索引=True)
df.head()
使用虚拟数据的示例 由于可用的虚拟数据还不是文本格式,所以我只使用了一些我制作的虚拟数据

               Client_ID    Client_Name  Pointer_of_Bins   Date        Weight
                C0000001       POLYGONE      TI006093     12/03/2019   0.5
                C0000001       POLYGONE      TI006093     12/03/2019   0.6
                C0000001       POLYGONE      TI006093     12/03/2019   1.4
                C0000001       POLYGONE      TI006897     14/03/2019   2.9   
                C0000001       POLYGONE      TI006093     22/04/2019   1.5
                C0000001       ALDI          TI006098     22/04/2019   0.7
                C0000001       ALDI          TI006098     22/04/2019   2.4
                C0000001       ALDI          TI006898     24/04/2019   1.9                                                             
将熊猫作为pd导入
从io导入字符串io#字符串到数据帧转换所需
file1=“”
第1列第2列第3列第4列第5列
1 ABCDE AE10 CD11 BC101F
2 GHJKL GL20 JK22 HJ202M
3 MNPKU MU30 PK33 NP303V
4 OPGHD OD40 GH44 PG404E
5 BHZKL BL50 ZK55 HZ505M
"""
file2=“”
第1列第2列第3列第4列第5列
1 AZYDE AE10 CD11 BC100F
2 GUFKL GL24 JK22 HJ207M
3 MHPRU MU77 PK39 NP309V
4 OPGBB OE90 GH41 PG405N
5 BHTGK BL70 ZK53 HZ508Z
"""
将数据作为单个数据帧加载,然后连接它们

               Client_ID    Client_Name  Pointer_of_Bins   Date        Weight
                C0000001       POLYGONE      TI006093     12/03/2019   0.5
                C0000001       POLYGONE      TI006093     12/03/2019   0.6
                C0000001       POLYGONE      TI006093     12/03/2019   1.4
                C0000001       POLYGONE      TI006897     14/03/2019   2.9   
                C0000001       POLYGONE      TI006093     22/04/2019   1.5
                C0000001       ALDI          TI006098     22/04/2019   0.7
                C0000001       ALDI          TI006098     22/04/2019   2.4
                C0000001       ALDI          TI006898     24/04/2019   1.9                                                             
df1=pd.read\u csv(StringIO(file1),sep='\t')
df2=pd.read\u csv(StringIO(文件2),sep='\t')
打印(pd.concat([df1,df2],忽略索引=True))
输出

               Client_ID    Client_Name  Pointer_of_Bins   Date        Weight
                C0000001       POLYGONE      TI006093     12/03/2019   0.5
                C0000001       POLYGONE      TI006093     12/03/2019   0.6
                C0000001       POLYGONE      TI006093     12/03/2019   1.4
                C0000001       POLYGONE      TI006897     14/03/2019   2.9   
                C0000001       POLYGONE      TI006093     22/04/2019   1.5
                C0000001       ALDI          TI006098     22/04/2019   0.7
                C0000001       ALDI          TI006098     22/04/2019   2.4
                C0000001       ALDI          TI006898     24/04/2019   1.9                                                             
   Col1   Col2  Col3  Col4    Col5
0     1  ABCDE  AE10  CD11  BC101F
1     2  GHJKL  GL20  JK22  HJ202M
2     3  MNPKU  MU30  PK33  NP303V
3     4  OPGHD  OD40  GH44  PG404E
4     5  BHZKL  BL50  ZK55  HZ505M
5     1  AZYDE  AE10  CD11  BC100F
6     2  GUFKL  GL24  JK22  HJ207M
7     3  MHPRU  MU77  PK39  NP309V
8     4  OPGBB  OE90  GH41  PG405N
9     5  BHTGK  BL70  ZK53  HZ508Z

摆脱
sep='delimeter'
。现在的代码是,将所有数据帧作为一列读取。@QuangHoang,感谢您的回复,但是如果我删除它,我会收到此错误(UnicodeDecodeError:'utf-8'编解码器无法解码位置8中的字节0xc7:无效的继续字节),请共享一些虚拟数据。我支持@QuangHoang提到的内容:您需要去掉
sep='delimiter'
,或者使用文件中使用的实际分隔符。这就是为什么我建议你共享一些虚拟数据(可能是4行只有5列),所以我们可以测试它。你可以考虑使用DASK。@ Wisamhasan感谢你使数据可用。但是,请将两个csv文件的前5列和前4行分别粘贴到问题陈述中,作为csv文件中的示例数据。然后也提供你所期望的。您的数据需要是最小的和可复制的。最好不要共享数据文件。@Wisamhasan感谢您提供的行和列。但是,我要求将数据作为文本粘贴到问题描述中。这使您的问题易于复制。请创建一个代码块,并将文件1和文件2中的数据列(子集)粘贴到该代码块中。谢谢,但附加的代码并不能解决问题。该代码用于回答您提到的问题。我留下了另一条关于检查实际使用的分隔符的评论。看起来您的问题存在于数据中。请检查使用了什么分隔符,然后再使用它。@如果满足“是”,您可以使用带有文件源标识符的多索引。但我建议不要将文件名作为多索引的一部分。文件名可能很长,当它们被命名时,您可能无法控制它们的命名逻辑。相反,如果您只想跟踪数据的来源,我建议您添加另一列“source”,并在那里填写文件名。您始终可以通过这种方式有条件地提取特定于文件的数据。但是考虑到你的索引单数尽可能长,除非它是绝对必要的。