Python3连接多个文件并忽略头&；预告片记录_Python_Python 3.x_Unix

Python3连接多个文件并忽略头&；预告片记录

python python-3.x unix

Python3连接多个文件并忽略头&；预告片记录,python,python-3.x,unix,Python,Python 3.x,Unix,我希望连接多个文件，并跳过所有文件中的标题和尾部记录，并且在连接时，列名（始终位于文件的第2行）在最终文件中只出现一次我可以连接，但如何跳过标题、尾部并只保留一次列名？每个文件大约有2500万条记录 File1.txt H,ABC,file1.txt Name,address,zipcode Rick,ABC,123 Tom,XYZ,456 T,2 -----------------record count File2.txt H,A

我希望连接多个文件，并跳过所有文件中的标题和尾部记录，并且在连接时，列名（始终位于文件的第2行）在最终文件中只出现一次
我可以连接，但如何跳过标题、尾部并只保留一次列名？每个文件大约有2500万条记录

File1.txt H,ABC,file1.txt Name,address,zipcode Rick,ABC,123 Tom,XYZ,456 T,2 -----------------record count File2.txt H,ABC,file2.txt Name,address,zipcode Jerry,ABC,123 T,1 File3.txt H,ABC,file3.txt Name,address,zipcode John,ABC,123 Mike,XYZ,456 T,2 ***Final Output:*** Name,address,zipcode Rick,ABC,123 Tom,XYZ,456 Jerry,ABC,123 Harry,XYZ,456 John,ABC,123 Mike,XYZ,456
代码：
1）您可以稍微修改一下在以下步骤中所做的操作：

filenames = ['File1.txt', 'File2.txt', 'file3.txt'] with open('output_file', 'w') as outfile: outille.write("Name,address,zipcode\n") for fname in filenames: with open(fname) as infile: for line in infile: if line.find("Trailer record") < 0 and line.find("Name,address,zipcode") < 0 : outfile.write(line)

filename=['File1.txt'，'File2.txt'，'file3.txt'] 将open（'output_file'，'w'）作为输出文件： outille.write（“名称、地址、zipcode\n”）对于文件名中的fname：将open（fname）作为内嵌：对于填充中的线：如果line.find（“拖车记录”）小于0且line.find（“名称、地址、zipcode”）<0：输出文件。写入（行）
2）或者，如果您熟悉unix中的grep命令，也可以使用它。您可以在Python中直接将其与sh库一起使用，并链接命令。
使用Python：下面是一个非常简单的方法，它使用连接TXT文件并输出到单个TXT文件，使用
输出：使用GNU
sed
：这里还有另一个选项，它将把每个名为
file*.txt
的文件的输出流到一个新文件（
all.txt
）中，跳过要错过的行；特别是第一、第二和最后一个
鉴于文件太大，您可能需要添加几个
printf
语句进行调试，以便在脚本循环文件时查看正在处理的文件

#!/usr/bin/env bash # Print the header to the output file. sed -n 2p file1.txt > all.txt # Stream (specific) content of all files to output file. for f in $( ls file*.txt ); do sed '1d;2d;$d' $f >> all.txt; done
输出：
谢谢，这是可行的，但是连接四个5 GB文件需要一个多小时才能完成。嗯。你在Linux上吗？我们可以查看sed，看看是否可以通过这种方式将每个文件流式处理为单个文件。请使用准确的页眉和页脚记录更新您的问题，以便我们知道如何构建正则表达式。或者。。。与其将数据存储在CSV文件中，不如将数据放入适当的数据库中；我的解决方案要求连接文件，而不是加载到表中。谢谢。考虑到文件的大小，您是否考虑过将数据存储到适当的数据库中？甚至像SQLite这样简单的东西？
import pandas as pd from glob import glob df = pd.DataFrame() files = glob('./addr_files/*.txt') for f in files: df = df.append(pd.read_csv(f, skiprows=1, skipfooter=1, engine='python')) df.to_csv('./addr_files/output.txt', index=False)

(py35) ~/Desktop/so/addr_files $ cat output.txt Name,address,zipcode Rick,ABC,123 Tom,XYZ,456 Jerry,ABC,123 Harry,XYZ,456 John,ABC,123 Mike,XYZ,456

#!/usr/bin/env bash # Print the header to the output file. sed -n 2p file1.txt > all.txt # Stream (specific) content of all files to output file. for f in $( ls file*.txt ); do sed '1d;2d;$d' $f >> all.txt; done

(base) user@host ~/Desktop/so/concat $ cat all.txt Name,address,zipcode Rick,ABC,123 Tom,XYZ,456 Jerry,ABC,123 Harry,XYZ,456 John,ABC,123 Mike,XYZ,456