Python 在某些条件下如何合并.csv文件？_Python_Python 2.7_File_Python 3.x_Csv

Python 在某些条件下如何合并.csv文件？

python python-2.7 file python-3.x csv

Python 在某些条件下如何合并.csv文件？,python,python-2.7,file,python-3.x,csv,Python,Python 2.7,File,Python 3.x,Csv,我在一个目录中有几个.csv文件。我想在给定一些条件的情况下，对这些文件进行迭代，并将它们合并成一个.csv文件每个文件使用相同的命名约定： Date Name City Supervisor 2015-01-01_Steve__Boston_Steven.csv 2015-10-03_Michael_Dallas_Thomas.csv 2015-02-10_John_NewYork_Michael.csv 每个文件仅包含一个长度不同的列： 2015-01-01_St

我在一个目录中有几个

.csv

文件。我想在给定一些条件的情况下，对这些文件进行迭代，并将它们合并成一个.csv文件

每个文件使用相同的命名约定：

Date       Name   City   Supervisor

2015-01-01_Steve__Boston_Steven.csv
2015-10-03_Michael_Dallas_Thomas.csv
2015-02-10_John_NewYork_Michael.csv

每个文件仅包含一个长度不同的列：

2015-01-01_Steve__Boston_Steven.csv

Sales
100
20
3
100
200

或

由于每个文件中的标题“Sales”的名称可能不同，因此我希望跳过第一行，并始终与第二行一起开始

我希望得到包含以下信息的最终表格：

Sales Name     City    Supervisor
100   Steve    Boston  Steven
20    Steve    Boston  Steven
30    Steve    Boston  Steven
3     Steve    Boston  Steven
100   Steve    Boston  Steven
200   Steve    Boston  Steven
1     Michael  Dallas  Thomas
2     Michael  Dallas  Thomas
1     John     NewYork Michael
2     John     NewYork Michael
3     John     NewYork Michael

我是python新手，因此对给您带来的不便深表歉意

我所尝试的：

import pandas as pd
from os import listdir

source_path, dst_path = '/oldpath', '/newpath'

files = [f for f in listdir(source_path) if f.endswith('.csv')]

def combining_data(files):
    df_list = []
    for filename in files:
        df_list.append(pd.read_csv(filename))

combining_data(files)

但不幸的是，这并不能产生所需的输出，这需要多个步骤。首先，我将解析CSV名称以获取名称、城市和主管。从外观上看，您可以在名称上使用

split

来获取这些值。然后，您必须读取文件并将其附加到新的CSV。使用熊猫也有点过分。您可以使用csv模块

import csv
import os

files = [f for f in os.listdir(source_path) if f.endswith('.csv')]

with open(os.path.join(source_path, 'new_csv.csv'), 'wb') as new:
    writer = csv.writer(new)
    writer.writerow(['Sales','Name','City','Supervisor'])  # write the header for the new csv
    for f in files:
        split = f[:-4].split('_')  # split the filename on _, while removing the .csv
        name = split[1]  # extract the name
        city = split[2]  # extract the city
        supervisor = split[3]  # extract the supervisor
        with open(os.path.join(source_path, f), 'rb') as readfile:
            reader = csv.reader(readfile)
            reader.next()  # Skip the header from the file you're reading
            for row in reader:
                writer.writerow([row[0], name, city, supervisor])  # write to the new csv

这需要多个步骤。首先，我将解析CSV名称以获取名称、城市和主管。从外观上看，您可以在名称上使用

split

来获取这些值。然后，您必须读取文件并将其附加到新的CSV。使用熊猫也有点过分。您可以使用csv模块

import csv
import os

files = [f for f in os.listdir(source_path) if f.endswith('.csv')]

with open(os.path.join(source_path, 'new_csv.csv'), 'wb') as new:
    writer = csv.writer(new)
    writer.writerow(['Sales','Name','City','Supervisor'])  # write the header for the new csv
    for f in files:
        split = f[:-4].split('_')  # split the filename on _, while removing the .csv
        name = split[1]  # extract the name
        city = split[2]  # extract the city
        supervisor = split[3]  # extract the supervisor
        with open(os.path.join(source_path, f), 'rb') as readfile:
            reader = csv.reader(readfile)
            reader.next()  # Skip the header from the file you're reading
            for row in reader:
                writer.writerow([row[0], name, city, supervisor])  # write to the new csv

熊猫：

import pandas as pd
import os

df=pd.DataFrame(columns=['Sales','Name','City','Supervisor'])
files = [f for f in os.listdir('.') if f.startswith('2015')]

for a in files:
    df1 = pd.read_csv(a, header=None, skiprows=1, names=['Sales'])
    len1 = len(df1.index)
    f = [b for b in a.split('_') if b]
    l2, l3 = [f[1], f[2], f[3][:-4]], ['Name','City','Supervisor']
    for b,c in zip(l2,l3):
        ser = pd.Series(data=[b for _ in range(len1)],index=range(len1))
        df1[c]=ser
    df = pd.concat([df,df1],axis=0)
df.index = range(len(df.index))
df.to_csv('new_csv.csv', index=None)
df

输出：

熊猫：

import pandas as pd
import os

df=pd.DataFrame(columns=['Sales','Name','City','Supervisor'])
files = [f for f in os.listdir('.') if f.startswith('2015')]

for a in files:
    df1 = pd.read_csv(a, header=None, skiprows=1, names=['Sales'])
    len1 = len(df1.index)
    f = [b for b in a.split('_') if b]
    l2, l3 = [f[1], f[2], f[3][:-4]], ['Name','City','Supervisor']
    for b,c in zip(l2,l3):
        ser = pd.Series(data=[b for _ in range(len1)],index=range(len1))
        df1[c]=ser
    df = pd.concat([df,df1],axis=0)
df.index = range(len(df.index))
df.to_csv('new_csv.csv', index=None)
df

输出：

谢谢你的建议。然而，我得到了错误：`TypeError Traceback（最近一次调用last）在（）1中打开（os.path.join（source\u path，'new\u csv.csv'），'wb'）作为新的：2 writer=csv.writer（new）-->3 writer.writerow（['Sales'，'Name'，'City'，'Supervisor'））#为文件中的f写入新csv 4的头：5 split=f[：-4]。split（“"”）#在删除.csv TypeError时在"上拆分文件名：需要一个类似字节的对象，而不是“str”`你在哪里取第二行而不是头？@Mamba-Hmm，看起来你在使用Python 3.5。在这种情况下，您需要使用

[b'Sales'，b'Name'，b'City'，b'Supervisor']

。您也可以尝试在没有二进制模式的情况下打开：

open（os.path.join（source\u path，'new\u csv.csv'），'w'）

。它将以文本模式而不是二进制模式写入。第二行是在我使用

reader.next（）

命令时使用的。这在读取文件时会跳过一行。此外，如果您计划以文本模式写入，那么也可以以文本模式读取文件：

以open（os.path.join（source_path，f），'r'）作为readfile:

@Scratch'N'Purr您的是16行。有了它的17行：）你仍然可以使它更紧凑。谢谢你的建议。然而，我得到了错误：`TypeError Traceback（最近一次调用last）在（）1中打开（os.path.join（source\u path，'new\u csv.csv'），'wb'）作为新的：2 writer=csv.writer（new）-->3 writer.writerow（['Sales'，'Name'，'City'，'Supervisor'））#为文件中的f写入新csv 4的头：5 split=f[：-4]。split（“"”）#在删除.csv TypeError时在"上拆分文件名：需要一个类似字节的对象，而不是“str”`你在哪里取第二行而不是头？@Mamba-Hmm，看起来你在使用Python 3.5。在这种情况下，您需要使用

[b'Sales'，b'Name'，b'City'，b'Supervisor']

。您也可以尝试在没有二进制模式的情况下打开：

open（os.path.join（source\u path，'new\u csv.csv'），'w'）

。它将以文本模式而不是二进制模式写入。第二行是在我使用

reader.next（）

命令时使用的。这在读取文件时会跳过一行。此外，如果您计划以文本模式写入，那么也可以以文本模式读取文件：

以open（os.path.join（source_path，f），'r'）作为readfile:

@Scratch'N'Purr您的是16行。使用pandas的17行：）您仍然可以使其更加紧凑。请使用pandasCheck查看我的解决方案使用PandasTX查看我的解决方案，以提供您的答案。然而，使用Python3.5我得到了一个错误：--------------------------------------------------------------------------------------

TypeError Traceback（最近一次调用）in（）2df1=pd.read\u csv（a）3f=filter（None，a.split（“”））----->4n，c，s=f[1]，f[2]，f[3][：-4]5len1=len（df01.index）TypeError:“filter”对象不可下标

@Mamba已删除筛选器（）由于python3而导致错误的部分。请重试。好的，它现在返回

ValueError:无法将float NaN转换为整数@Mamba Ok修改了代码。请重试。让它现在是float。让我们先让它工作，这样我们就可以抛出一个party:P:）酷，现在它返回数据帧，让zo检查一下它是否正确：-）但是非常感谢！！谢谢你提供答案。然而，使用Python3.5我得到了一个错误：--------------------------------------------------------------------------------------TypeError Traceback（最近一次调用）in（）2df1=pd.read\u csv（a）3f=filter（None，a.split（“”））----->4n，c，s=f[1]，f[2]，f[3][：-4]5len1=len（df01.index）TypeError:“filter”对象不可下标
@Mamba已删除筛选器（）由于python3而导致错误的部分。请重试。好的，它现在返回ValueError:无法将float NaN转换为整数@Mamba Ok修改了代码。请重试。让它现在是float。让我们先让它工作，这样我们就可以抛出一个party:P:）酷，现在它返回数据帧，让zo检查一下它是否正确：-）但是非常感谢！！
CPU times: user 16 ms, sys: 0 ns, total: 16 ms
Wall time: 22.6 ms