Python hadoop文件系统打开文件并跳过第一行_Python_Hadoop_Filesystems_Hdfs

Python hadoop文件系统打开文件并跳过第一行

python hadoop filesystems

Python hadoop文件系统打开文件并跳过第一行,python,hadoop,filesystems,hdfs,Python,Hadoop,Filesystems,Hdfs,我正在使用Python语言读取HDFS中的文件每个文件都有一个头，我正在尝试合并这些文件。但是，每个文件中的头也会合并有没有办法跳过第二个文件的标题 hadoop=sc.\u jvm.org.apache.hadoop conf=hadoop.conf.Configuration fs=hadoop.fs.FileSystem.getconf src_dir=/mnt/test/ out_stream=fs.createhadoop.fs.Pathdst_文件，覆盖文件=[] 对于fs.l

我正在使用Python语言读取HDFS中的文件

每个文件都有一个头，我正在尝试合并这些文件。但是，每个文件中的头也会合并

有没有办法跳过第二个文件的标题

hadoop=sc.\u jvm.org.apache.hadoop conf=hadoop.conf.Configuration fs=hadoop.fs.FileSystem.getconf src_dir=/mnt/test/ out_stream=fs.createhadoop.fs.Pathdst_文件，覆盖文件=[] 对于fs.listStatushadoop.fs.Pathsrc_dir中的f：如果是f.isFile： files.appendf.getPath 对于文件中的文件： in_stream=fs.openfile hadoop.io.IOUtils.copyBytesin\u流，out\u流，conf，False 目前我已经用下面的逻辑解决了这个问题，但是我想知道是否有更好、更有效的解决方案？谢谢你的帮助

for idx,file in enumerate(files):
            if debug: 
                print("Appending file {} into {}".format(file, dst_file))

            # remove header from the second file
            if idx>0:
              file_str = ""
              with open('/'+str(file).replace(':',''),'r+') as f:
                for idx,line in enumerate(f):
                  if idx>0:
                    file_str = file_str + line

              with open('/'+str(file).replace(':',''), "w+") as f:
                f.write(file_str)
            in_stream = fs.open(file)   # InputStream object and copy the stream
            try:
                hadoop.io.IOUtils.copyBytes(in_stream, out_stream, conf, False)     # False means don't close out_stream
            finally:
                in_stream.close()

您现在所做的是重复地向字符串追加内容。这是一个相当缓慢的过程。为什么不在读取时直接写入输出文件

对于文件_idx，枚举文件中的文件：打开。。。作为出局，打开。。。如图所示：对于第_num行，枚举中的第_f行：如果文件\u idx==0或行\u num>0： f_out.writeline 如果可以一次加载所有文件，还可以使用readline，然后使用readline跳过第一行：

对于文件_idx，枚举文件中的文件：打开。。。作为出局，打开。。。如图所示：如果文件_idx！=0: f_in.readline f_out.writelines f_in.readlines

您好，Adran，我了解解决方案，但是，我正在尝试覆盖现有文件，建议的解决方案是否适用于这种情况？如果您告诉它打开…，“w”，它将覆盖，请注意，我们缺少w旁边的+字符。只有在将+字符添加到第二个参数（例如w+）时，它才会附加到文件中。这不会删除文件，它只是完全替换了内容，因此，如果您需要，请确保在将旧文件移动到HDFS后使用类似os.remove的内容删除它。