python脚本将目录中的所有文件连接到一个文件中
我编写了以下脚本,将目录中的所有文件连接到一个文件中 这能在以下方面得到优化吗python脚本将目录中的所有文件连接到一个文件中,python,file,copy,Python,File,Copy,我编写了以下脚本,将目录中的所有文件连接到一个文件中 这能在以下方面得到优化吗 惯用python 时间 以下是片段: import time, glob outfilename = 'all_' + str((int(time.time()))) + ".txt" filenames = glob.glob('*.txt') with open(outfilename, 'wb') as outfile: for fname in filenames: with o
import time, glob
outfilename = 'all_' + str((int(time.time()))) + ".txt"
filenames = glob.glob('*.txt')
with open(outfilename, 'wb') as outfile:
for fname in filenames:
with open(fname, 'r') as readfile:
infile = readfile.read()
for line in infile:
outfile.write(line)
outfile.write("\n\n")
您可以直接迭代文件对象的行,而无需将整个内容读入内存:
with open(fname, 'r') as readfile:
for line in readfile:
outfile.write(line)
用于复制数据:
import shutil
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
shutil.copyfileobj(readfile, outfile)
shutil
以块的形式读取readfile
对象,并将它们直接写入outfile
fileobject。不要使用readline()
或迭代缓冲区,因为您不需要查找行尾的开销
使用相同的模式进行阅读和写作;这在使用Python3时尤为重要;我在这里使用了二进制模式 不需要使用那么多变量
with open(outfilename, 'w') as outfile:
for fname in filenames:
with open(fname, 'r') as readfile:
outfile.write(readfile.read() + "\n\n")
该模块提供了一种自然的方式来迭代多个文件
for line in fileinput.input(glob.glob("*.txt")):
outfile.write(line)
使用Python2.7,我对
outfile.write(infile.read())
vs
我迭代了20个.txt文件,大小从63MB到313MB不等,联合文件大小约为2.6GB。在这两种方法中,正常读取模式的性能都优于二进制读取模式,而shutil.copyfileobj通常比outfile.write快
当比较最差的组合(outfile.write,二进制模式)和最佳组合(shutil.copyfileobj,正常读取模式)时,差异非常显著:
outfile.write, binary mode: 43 seconds, on average.
shutil.copyfileobj, normal mode: 27 seconds, on average.
输出文件在正常读取模式下的最终大小为2620 MB,而在二进制读取模式下的最终大小为2578 MB。我很想进一步检查性能,我使用了Martijn Pieters和Stephen Miller的答案 我用
shutil
和不使用shutil
尝试了二进制和文本模式。我试图合并270个文件
文本模式-
def using_shutil_text(outfilename):
with open(outfilename, 'w') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'r') as readfile:
shutil.copyfileobj(readfile, outfile)
def without_shutil_text(outfilename):
with open(outfilename, 'w') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'r') as readfile:
outfile.write(readfile.read())
Shutil - 20.47757601737976
Normal - 13.718038082122803
二进制模式-
def using_shutil_text(outfilename):
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
shutil.copyfileobj(readfile, outfile)
def without_shutil_text(outfilename):
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
outfile.write(readfile.read())
Shutil - 20.161773920059204
Normal - 17.327500820159912
二进制模式的运行时间-
def using_shutil_text(outfilename):
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
shutil.copyfileobj(readfile, outfile)
def without_shutil_text(outfilename):
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
outfile.write(readfile.read())
Shutil - 20.161773920059204
Normal - 17.327500820159912
文本模式的运行时间-
def using_shutil_text(outfilename):
with open(outfilename, 'w') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'r') as readfile:
shutil.copyfileobj(readfile, outfile)
def without_shutil_text(outfilename):
with open(outfilename, 'w') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'r') as readfile:
outfile.write(readfile.read())
Shutil - 20.47757601737976
Normal - 13.718038082122803
看起来在这两种模式下,shutil的性能相同,而文本模式比二进制模式快
操作系统:MacOS10.14Mojave。Macbook Air 2017 时间优化?使用“cat*.txt>all.txt”:(如果它不局限于一次读取一行,那么它可能的副本会更好。@Marcin,这是正确的。我曾经认为这是一个很酷的解决方案——直到我看到Martijn Pieter的
shutil.copyfileobj
humdinger.interest。那是什么平台?我大致在两个平台上工作:Linux Fedora 16、不同节点或Windows 7 Enterprise SP1,采用Intel Core(TM)2四CPU Q9550、2.83 GHz。我认为是后者。为什么使用相同的模式进行写入和读取很重要?@JuanDavid:因为shutil将对一个文件对象使用.read()
调用,.write()
调用另一个文件对象,将读取的数据从一个文件对象传递到另一个文件对象。如果一个以二进制模式打开,另一个以文本模式打开,则表示您正在传递不兼容的数据(二进制数据到文本文件,或文本数据到二进制文件)。此处的代码不适用于CSV文件,dang。但它确实给了我一些很好的灵感,告诉我如何使用CSV来实现这一点。我对Python比较陌生。@bretts:文件的内容应该不重要;可能您的CSV文件缺少最后一个换行符分隔符,或者使用了不同的分隔符格式?