python脚本将目录中的所有文件连接到一个文件中_Python_File_Copy

python脚本将目录中的所有文件连接到一个文件中

python file

python脚本将目录中的所有文件连接到一个文件中,python,file,copy,Python,File,Copy,我编写了以下脚本，将目录中的所有文件连接到一个文件中这能在以下方面得到优化吗惯用python 时间以下是片段： import time, glob outfilename = 'all_' + str((int(time.time()))) + ".txt" filenames = glob.glob('*.txt') with open(outfilename, 'wb') as outfile: for fname in filenames: with o

我编写了以下脚本，将目录中的所有文件连接到一个文件中

这能在以下方面得到优化吗

惯用python

时间

以下是片段：

import time, glob

outfilename = 'all_' + str((int(time.time()))) + ".txt"

filenames = glob.glob('*.txt')

with open(outfilename, 'wb') as outfile:
    for fname in filenames:
        with open(fname, 'r') as readfile:
            infile = readfile.read()
            for line in infile:
                outfile.write(line)
            outfile.write("\n\n")

您可以直接迭代文件对象的行，而无需将整个内容读入内存：

with open(fname, 'r') as readfile:
    for line in readfile:
        outfile.write(line)

用于复制数据：

import shutil

with open(outfilename, 'wb') as outfile:
    for filename in glob.glob('*.txt'):
        if filename == outfilename:
            # don't want to copy the output into the output
            continue
        with open(filename, 'rb') as readfile:
            shutil.copyfileobj(readfile, outfile)

shutil

以块的形式读取

readfile

对象，并将它们直接写入

outfile

fileobject。不要使用

readline（）

或迭代缓冲区，因为您不需要查找行尾的开销

使用相同的模式进行阅读和写作；这在使用Python3时尤为重要；我在这里使用了二进制模式

不需要使用那么多变量

with open(outfilename, 'w') as outfile:
    for fname in filenames:
        with open(fname, 'r') as readfile:
            outfile.write(readfile.read() + "\n\n")

该模块提供了一种自然的方式来迭代多个文件

for line in fileinput.input(glob.glob("*.txt")):
    outfile.write(line)

使用Python2.7，我对

outfile.write(infile.read())

我迭代了20个.txt文件，大小从63MB到313MB不等，联合文件大小约为2.6GB。在这两种方法中，正常读取模式的性能都优于二进制读取模式，而shutil.copyfileobj通常比outfile.write快

当比较最差的组合（outfile.write，二进制模式）和最佳组合（shutil.copyfileobj，正常读取模式）时，差异非常显著：

outfile.write, binary mode: 43 seconds, on average.

shutil.copyfileobj, normal mode: 27 seconds, on average.

输出文件在正常读取模式下的最终大小为2620 MB，而在二进制读取模式下的最终大小为2578 MB。

我很想进一步检查性能，我使用了Martijn Pieters和Stephen Miller的答案

我用

shutil

和不使用

shutil

尝试了二进制和文本模式。我试图合并270个文件

文本模式-

def using_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                outfile.write(readfile.read())

Shutil - 20.47757601737976
Normal - 13.718038082122803

二进制模式-

def using_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                outfile.write(readfile.read())

Shutil - 20.161773920059204
Normal - 17.327500820159912

二进制模式的运行时间-

def using_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                outfile.write(readfile.read())

Shutil - 20.161773920059204
Normal - 17.327500820159912

文本模式的运行时间-

def using_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                outfile.write(readfile.read())

Shutil - 20.47757601737976
Normal - 13.718038082122803

看起来在这两种模式下，shutil的性能相同，而文本模式比二进制模式快

操作系统：MacOS10.14Mojave。Macbook Air 2017

时间优化？使用“cat*.txt>all.txt”：（如果它不局限于一次读取一行，那么它可能的副本会更好。@Marcin，这是正确的。我曾经认为这是一个很酷的解决方案——直到我看到Martijn Pieter的

shutil.copyfileobj

humdinger.interest。那是什么平台？我大致在两个平台上工作：Linux Fedora 16、不同节点或Windows 7 Enterprise SP1，采用Intel Core（TM）2四CPU Q9550、2.83 GHz。我认为是后者。为什么使用相同的模式进行写入和读取很重要？@JuanDavid:因为shutil将对一个文件对象使用

.read（）

调用，

.write（）

调用另一个文件对象，将读取的数据从一个文件对象传递到另一个文件对象。如果一个以二进制模式打开，另一个以文本模式打开，则表示您正在传递不兼容的数据（二进制数据到文本文件，或文本数据到二进制文件）。此处的代码不适用于CSV文件，dang。但它确实给了我一些很好的灵感，告诉我如何使用CSV来实现这一点。我对Python比较陌生。@bretts：文件的内容应该不重要；可能您的CSV文件缺少最后一个换行符分隔符，或者使用了不同的分隔符格式？