Python 基于字段拆分大文件，并为每个文件添加唯一标识符_Python_Bash_Unix_Awk_Sed

Python 基于字段拆分大文件，并为每个文件添加唯一标识符

python bash unix awk sed

Python 基于字段拆分大文件，并为每个文件添加唯一标识符,python,bash,unix,awk,sed,Python,Bash,Unix,Awk,Sed,我有这样一个（巨大的）文件：测试文件 a b a c a d b a b b a g a j c g 1 a b 1 a c 1 a d 2 b a 2 b b 3 a g 3 a j 4 c g 我试图根据第一个字段将其拆分为多个文件。但是，重复但非连续的值应创建一个新文件（即，每次字段1中的值相对于前一行发生变化时，应生成一个新文件）。因此，在我前面的示例中，行： a g a j

我有这样一个（巨大的）文件：

测试文件

a   b
a   c
a   d
b   a
b   b
a   g
a   j
c   g

1    a  b
1    a  c
1    a  d
2    b  a
2    b  b
3    a  g
3    a  j
4    c  g

我试图根据第一个字段将其拆分为多个文件。但是，重复但非连续的值应创建一个新文件（即，每次字段1中的值相对于前一行发生变化时，应生成一个新文件）。因此，在我前面的示例中，行：

a   g
a   j

应转到与以下文件不同的新文件：

a   b
a   c
a   d

最后，我将有4个文件，每个文件代表字段1中的一个更改：

a、一,

b、二,

a、三,

c、四,

实际上，如果标识符是：a.1、b.1、a.2、c.1或任何其他类型的后缀，它也可以工作。我希望避免使用a值的第二个子集/组替换/覆盖使用a值的第一组生成的上一个文件。我也不希望所有a值都附加到同一个文件中

我知道：

awk '{print > $1; close( $1)}' test_file

将使用第一列进行拆分，但当键相等时，它也会将结果附加到同一文件中

为了避免这个问题，我考虑添加另一个真正不同的字段。比如：

测试文件

a   b
a   c
a   d
b   a
b   b
a   g
a   j
c   g

1    a  b
1    a  c
1    a  d
2    b  a
2    b  b
3    a  g
3    a  j
4    c  g

然后做：

 awk '{print > $1"_"$2; close( $1"_"$2) }' test_file

但我真的找不到方法，因为我认为关联数组在这种情况下不起作用。有什么想法吗

听起来你可能想要这个：

awk '$1!=prev{ close(out); out="File_"$1"."(++cnt); prev=$1 } { print > out }' test_file

但是你的问题并不完全清楚。

Awk确实更容易，不是吗

#!/usr/bin/env python
files_count = 1
first_col = None
with open('maria.txt') as maria:
    for line in maria:
        line = line.rstrip()
        columns = line.split()
        if columns[0] == first_col:
            print (line, file=current_out)
        else:
            first_col = columns[0]
            current_out = open(first_col+'.'+str(files_count), 'w')
            files_count+=1
            print (line, file=current_out)

在Python2.x中，这可以使用

groupby

实现，如下所示：

import csv
from itertools import groupby

with open('huge.txt', 'rb') as f_input:
    csv_input = csv.reader(f_input, delimiter=' ', skipinitialspace=True)

    for index, (k, g) in enumerate(groupby(csv_input, lambda x: x[0]), start=1):
        with open('{}.{}'.format(k, index), 'wb') as f_output:
            csv.writer(f_output, delimiter=' ').writerows(g)

如果您使用的是Python 3.x：

import csv
from itertools import groupby

with open('huge.txt', 'r', newline='') as f_input:
    csv_input = csv.reader(f_input, delimiter=' ', skipinitialspace=True)

    for index, (k, g) in enumerate(groupby(csv_input, lambda x: x[0]), start=1):
        with open('{}.{}'.format(k, index), 'w', newline='') as f_output:
            csv.writer(f_output, delimiter=' ').writerows(g)

伟大的谢谢！：）：）

import csv
from itertools import groupby

with open('huge.txt', 'rb') as f_input:
    csv_input = csv.reader(f_input, delimiter=' ', skipinitialspace=True)

    for index, (k, g) in enumerate(groupby(csv_input, lambda x: x[0]), start=1):
        with open('{}.{}'.format(k, index), 'wb') as f_output:
            csv.writer(f_output, delimiter=' ').writerows(g)

import csv
from itertools import groupby

with open('huge.txt', 'r', newline='') as f_input:
    csv_input = csv.reader(f_input, delimiter=' ', skipinitialspace=True)

    for index, (k, g) in enumerate(groupby(csv_input, lambda x: x[0]), start=1):
        with open('{}.{}'.format(k, index), 'w', newline='') as f_output:
            csv.writer(f_output, delimiter=' ').writerows(g)