根据Python中随机出现的模式将大文件拆分为小文件_Python_Split

根据Python中随机出现的模式将大文件拆分为小文件

python

根据Python中随机出现的模式将大文件拆分为小文件,python,split,Python,Split,我有一个包含以下内容的大文件： Column1 column2 column3 345 367 Ramesh 456 469 Ramesh 300 301 Ramesh 298 390 Naresh 123 125 Suresh 394 305 Suresh ...... ..... 现在，我想根据第3列中的名称将此文件拆分为小文件。像这样：文件1:Ramesh.txt column1 col

我有一个包含以下内容的大文件：

Column1 column2 column3
 345     367    Ramesh
 456     469    Ramesh
 300     301    Ramesh
 298     390    Naresh
 123     125    Suresh
 394     305    Suresh
 ......
 .....

现在，我想根据第3列中的名称将此文件拆分为小文件。像这样：

文件1:Ramesh.txt

column1 column2 column3
345     367      Ramesh
456     469      Ramesh
300     301      Ramesh

文件2:Naresh.txt

column1 column2 column3
298     390     Naresh

文件3:Suresh.txt

Column1 column2 column3
123     125      suresh
394     305      suresh

同样地。我编写了以下python代码，它很有效：

def split_file(file1):
source=open(file1)
l=[]
header=0
header_line=""
file_count=0
for line in source:
    line=line.rstrip()
    a=line.split()
    if header==0:
        header_line=line
        header+=1
    else:
        if a[-1] not in l:
            l.append(a[-1])
            file_count+=1
            if file_count>1:
                dest.close()
            else:
                pass
            dest=open(a[-1],'a')
            dest.write(header_line+"\n"+line+"\n")
        else:
            dest.write(line+"\n")
source.close()
dest.close()

现在，我的问题是，即使第3列未排序，如何修改这些代码以使其工作。例如：

Column1 column2 column3
345     367    Ramesh
123     125    Suresh
456     469    Ramesh
298     390    Naresh
300     301    Ramesh
394     305    Suresh

我应该生成随机变量作为值（处理输出文件），第3列中的名称作为键。然后每次脚本遇到密钥时都使用此字典打开文件？如有任何建议，将不胜感激

而不是在每一行上打开和关闭文件指针，您可以将它们保持打开状态，直到您的工作完成

首先为文件指针创建字典：

fps = {}

for f in fps.values():
    f.close()

然后在迭代数据文件的循环中，如果文件指针不存在，则创建它：

if a[-1] not in fps.keys():
    fps[a[-1]] = open(a[-1], 'a')
fps[a[-1]].write(line)

然后在循环结束时，可以关闭文件指针：

fps = {}

for f in fps.values():
    f.close()

这是数据帧的

groupby（）

函数的一个主要示例：

import pandas as pd

data = pd.read_csv('dat.csv', delimiter="\s+")
for val, df in data.groupby(['column3']):
    df.to_csv(val + ".csv", sep='\t', index=False)

步骤相对简单：

1）使用正确的分隔符读取文件（

\s+

表示任意数量的空格）

2）循环通过groupy对象，该对象包含

形式的元组（公共值，该值的数据帧）

2.1）为每个数据帧生成一个具有相应名称的文件。

（

index=False

只是说明我们不想在新文件中打印索引。）

您可以为

column3

的每个值创建一个新的文件句柄，然后将其全部写入该文件，例如：

import os

def split_file(path):
    file_handles = {}  # a map of file handles based on the last param
    target_path = os.path.dirname(path)  # get the location of the passed file path
    with open(path, "r") as f:  # open our input file for reading
        header = next(f)  # reads the first line to use as a header in all files
        for line in f:
            index = line.rfind(" ")  # replace with \t if you use tab-delimited files
            value = line[index+1:].rstrip()  # get the last value
            if not value:  # invalid entry, skip
                continue
            if value not in file_handles:  # we haven't started writing to this file
                # create a new file with the value of the last column
                handle = open(os.path.join(target_path, value + ".txt"), "a")
                handle.write(header)  # write the header to our new file
                file_handles[value] = handle  # store it to our file handles list
            else:
                handle = file_handles[value]
            handle.write(line)  # write the current line to the designated handle
    for handle in file_handles.values():  # close our output file handles
        handle.close()

然后，您可以使用一个简单的

split_file("your_file.dat")

如果您传递文件路径，它甚至会尊重文件路径。

我建议为基于Unix的操作系统提供一个简短的命令行解决方案