File awk将单个文件划分为具有特定文件名的多个文件_File_Awk_Sed

File awk将单个文件划分为具有特定文件名的多个文件

file awk sed

File awk将单个文件划分为具有特定文件名的多个文件,file,awk,sed,File,Awk,Sed,我有一个原始文件，其中包含以下特定格式的数据： $ cat sample.txt >MA0002.1 RUNX1 A [ 10 12 4 1 2 2 0 0 0 8 13 ] C [ 2 2 7 1 0 8 0 0 1 2 2 ] G [ 3 1

我有一个原始文件，其中包含以下特定格式的数据：

$ cat sample.txt
>MA0002.1   RUNX1
A  [    10     12      4      1      2      2      0      0      0      8     13 ]
C  [     2      2      7      1      0      8      0      0      1      2      2 ]
G  [     3      1      1      0     23      0     26     26      0      0      4 ]
T  [    11     11     14     24      1     16      0      0     25     16      7 ]
>MA0003.1   TFAP2A
A  [     0      0      0     22     19     55     53     19      9 ]
C  [     0    185    185     71     57     44     30     16     78 ]
G  [   185      0      0     46     61     67     91    137     79 ]
T  [     0      0      0     46     48     19     11     13     19 ]
>MA0003.3   TFAP2C
A  [  1706    137      0      0     33    575   3640   1012      0     31   1865 ]
C  [  1939    968   5309   5309   1646   2682    995    224     31   4726    798 ]
G  [   277   4340    139     11    658   1613    618   5309   5309    582   1295 ]
T  [  1386     47      0    281   2972    438     56      0      0     21   1350 ]

我想根据字母

将此文件分为单独的文件，并且我知道此字符每隔5行出现一次。我可以通过以下方式做到这一点：

awk 'NR%5==1{x="F"++i;}{print > x}' sample.txt

问题是它正确地创建了多个文件，但文件名分别是F1、F2和F3，并且没有任何扩展名。我想按第一行中提到的名称保存这些单独的文件，它们是

RUNX1

、

TFAP2A

和

TFAP2C

，扩展名为

.pfm

因此最终文件将如下所示：

$ cat RUNX1.pfm
>MA0002.1   RUNX1
A  [    10     12      4      1      2      2      0      0      0      8     13 ]
C  [     2      2      7      1      0      8      0      0      1      2      2 ]
G  [     3      1      1      0     23      0     26     26      0      0      4 ]
T  [    11     11     14     24      1     16      0      0     25     16      7 ]

$ cat TFAP2A.pfm
>MA0003.1   TFAP2A
A  [     0      0      0     22     19     55     53     19      9 ]
C  [     0    185    185     71     57     44     30     16     78 ]
G  [   185      0      0     46     61     67     91    137     79 ]
T  [     0      0      0     46     48     19     11     13     19 ]

等等

谢谢你抽出时间来帮助我

遵循awk可能会在同样的情况下帮助您

awk '/^>/{if(file){close(file)};file=$2".pfm"} {print > file".pfm"}'  Input_file

在这里添加一个非一行表单并进行解释

awk '
/^>/{             ##Checking here if any line starts with ">" if yes then do following actions.
  if(file){       ##Checking if value of variable named file is NOT NULL, if condition is TRUE then do following.
    close(file)   ##close is awk out of the box command which will close any opened file, so that we could avoid situation of too many files opened at a time.
};
  file=$2".pfm"   ##Setting variable named file to 2nd filed of the line which starts from ">" here.
}
{
print > file".pfm"##Printing the value of current line to file".pfm" which will create file with $2 and .pfm name and put output into output files.
}
' Input_file      ##Mentioning the Input_file name here.

编辑：

以下awk可能会在同样的情况下帮助您

awk '/^>/{if(file){close(file)};file=$2".pfm"} {print > file".pfm"}'  Input_file

在这里添加一个非一行表单并进行解释

awk '
/^>/{             ##Checking here if any line starts with ">" if yes then do following actions.
  if(file){       ##Checking if value of variable named file is NOT NULL, if condition is TRUE then do following.
    close(file)   ##close is awk out of the box command which will close any opened file, so that we could avoid situation of too many files opened at a time.
};
  file=$2".pfm"   ##Setting variable named file to 2nd filed of the line which starts from ">" here.
}
{
print > file".pfm"##Printing the value of current line to file".pfm" which will create file with $2 and .pfm name and put output into output files.
}
' Input_file      ##Mentioning the Input_file name here.

编辑：

awk进近：

awk 'NR%5==1{ fn=$2".pfm" }fn{ print > fn}' file

或使用

标记进行相同操作：

awk '/^>/{ fn=$2".pfm" }fn{ print > fn}' file

awk进近：

awk 'NR%5==1{ fn=$2".pfm" }fn{ print > fn}' file

或使用

标记进行相同操作：

awk '/^>/{ fn=$2".pfm" }fn{ print > fn}' file

就这样

awk -v RS=">" '{print RS$0 > $2".pfm"; close($2".pfm")}' file

awk -v RS=">" '{a[$2]++; if(a[$2]>1) file=$2"."a[$2]; else file=$2; print RS$0 > file".pfm" ; close(file".pfm")}' file

如果已保存同名文件，则要保存新文件，请使用此文件：

awk -v RS=">" '{print RS$0 > $2".pfm"; close($2".pfm")}' file

awk -v RS=">" '{a[$2]++; if(a[$2]>1) file=$2"."a[$2]; else file=$2; print RS$0 > file".pfm" ; close(file".pfm")}' file

例如，如果之前保存了TFAP2A.pfm，则新文件将保存为TFAP2A.2.pfmTFAP2A.3.pfm。。。。等等

或者干脆

awk -v RS=">" '{file=$2"."++a[$2]; print RS$0 > file".pfm" ; close(file".pfm")}' file

如果要使用版本Ex.abc.1.pfm abc.2.pfm保存每个文件就是这样

awk -v RS=">" '{print RS$0 > $2".pfm"; close($2".pfm")}' file

awk -v RS=">" '{a[$2]++; if(a[$2]>1) file=$2"."a[$2]; else file=$2; print RS$0 > file".pfm" ; close(file".pfm")}' file

如果已保存同名文件，则要保存新文件，请使用此文件：

awk -v RS=">" '{print RS$0 > $2".pfm"; close($2".pfm")}' file

awk -v RS=">" '{a[$2]++; if(a[$2]>1) file=$2"."a[$2]; else file=$2; print RS$0 > file".pfm" ; close(file".pfm")}' file

例如，如果之前保存了TFAP2A.pfm，则新文件将保存为TFAP2A.2.pfmTFAP2A.3.pfm。。。。等等

或者干脆

awk -v RS=">" '{file=$2"."++a[$2]; print RS$0 > file".pfm" ; close(file".pfm")}' file

如果要使用以下版本保存每个文件，例如abc.1.pfm abc.2.pfm

如果名称被多次使用，则需要小心

一艘班轮：

awk '/>/{f=$2 (a[$2]++?"."a[$2]-1:"") ".pfm"; if(f!=p){ close(p); p=f}}{print >f}' file

可读性更好：

 awk '/>/{
           f=$2 (a[$2]++?"."a[$2]-1:"") ".pfm"; 
           if(f!=p){ 
                close(p); 
                p=f
           }
          }
          {
            print >f
          }
     ' file

输入：

$ cat file
>MA0002.1   RUNX1
A  [    10     12      4      1      2      2      0      0      0      8     13 ]
C  [     2      2      7      1      0      8      0      0      1      2      2 ]
G  [     3      1      1      0     23      0     26     26      0      0      4 ]
T  [    11     11     14     24      1     16      0      0     25     16      7 ]
>MA0003.3   TFAP2C
A  [  1706    137      0      0     33    575   3640   1012      0     31   1865 ]
C  [  1939    968   5309   5309   1646   2682    995    224     31   4726    798 ]
G  [   277   4340    139     11    658   1613    618   5309   5309    582   1295 ]
T  [  1386     47      0    281   2972    438     56      0      0     21   1350 ]
>MA0003.1   TFAP2A
A  [     0      0      0     22     19     55     53     19      9 ]
C  [     0    185    185     71     57     44     30     16     78 ]
G  [   185      0      0     46     61     67     91    137     79 ]
T  [     0      0      0     46     48     19     11     13     19 ]
>MA0003.3   TFAP2C
A  [  1706    137      0      0     33    575   3640   1012      0     31   1865 ]
C  [  1939    968   5309   5309   1646   2682    995    224     31   4726    798 ]
G  [   277   4340    139     11    658   1613    618   5309   5309    582   1295 ]
T  [  1386     47      0    281   2972    438     56      0      0     21   1350 ]

执行：

$ awk '/>/{f=$2 (a[$2]++?"."a[$2]-1:"") ".pfm"; if(f!=p){ close(p); p=f}}{print >f}' file

输出文件：

$ ls *.pfm -1
RUNX1.pfm
TFAP2A.pfm
TFAP2C.1.pfm
TFAP2C.pfm

每个文件的内容：

$ for i in *.pfm; do echo "Output File:$i"; cat "$i"; done
Output File:RUNX1.pfm
>MA0002.1   RUNX1
A  [    10     12      4      1      2      2      0      0      0      8     13 ]
C  [     2      2      7      1      0      8      0      0      1      2      2 ]
G  [     3      1      1      0     23      0     26     26      0      0      4 ]
T  [    11     11     14     24      1     16      0      0     25     16      7 ]
Output File:TFAP2A.pfm
>MA0003.1   TFAP2A
A  [     0      0      0     22     19     55     53     19      9 ]
C  [     0    185    185     71     57     44     30     16     78 ]
G  [   185      0      0     46     61     67     91    137     79 ]
T  [     0      0      0     46     48     19     11     13     19 ]
Output File:TFAP2C.1.pfm
>MA0003.3   TFAP2C
A  [  1706    137      0      0     33    575   3640   1012      0     31   1865 ]
C  [  1939    968   5309   5309   1646   2682    995    224     31   4726    798 ]
G  [   277   4340    139     11    658   1613    618   5309   5309    582   1295 ]
T  [  1386     47      0    281   2972    438     56      0      0     21   1350 ]

Output File:TFAP2C.pfm
>MA0003.3   TFAP2C
A  [  1706    137      0      0     33    575   3640   1012      0     31   1865 ]
C  [  1939    968   5309   5309   1646   2682    995    224     31   4726    798 ]
G  [   277   4340    139     11    658   1613    618   5309   5309    582   1295 ]
T  [  1386     47      0    281   2972    438     56      0      0     21   1350 ]

如果名字被多次使用，下面的一个要小心

一艘班轮：

awk '/>/{f=$2 (a[$2]++?"."a[$2]-1:"") ".pfm"; if(f!=p){ close(p); p=f}}{print >f}' file

可读性更好：

 awk '/>/{
           f=$2 (a[$2]++?"."a[$2]-1:"") ".pfm"; 
           if(f!=p){ 
                close(p); 
                p=f
           }
          }
          {
            print >f
          }
     ' file

输入：

$ cat file
>MA0002.1   RUNX1
A  [    10     12      4      1      2      2      0      0      0      8     13 ]
C  [     2      2      7      1      0      8      0      0      1      2      2 ]
G  [     3      1      1      0     23      0     26     26      0      0      4 ]
T  [    11     11     14     24      1     16      0      0     25     16      7 ]
>MA0003.3   TFAP2C
A  [  1706    137      0      0     33    575   3640   1012      0     31   1865 ]
C  [  1939    968   5309   5309   1646   2682    995    224     31   4726    798 ]
G  [   277   4340    139     11    658   1613    618   5309   5309    582   1295 ]
T  [  1386     47      0    281   2972    438     56      0      0     21   1350 ]
>MA0003.1   TFAP2A
A  [     0      0      0     22     19     55     53     19      9 ]
C  [     0    185    185     71     57     44     30     16     78 ]
G  [   185      0      0     46     61     67     91    137     79 ]
T  [     0      0      0     46     48     19     11     13     19 ]
>MA0003.3   TFAP2C
A  [  1706    137      0      0     33    575   3640   1012      0     31   1865 ]
C  [  1939    968   5309   5309   1646   2682    995    224     31   4726    798 ]
G  [   277   4340    139     11    658   1613    618   5309   5309    582   1295 ]
T  [  1386     47      0    281   2972    438     56      0      0     21   1350 ]

执行：

$ awk '/>/{f=$2 (a[$2]++?"."a[$2]-1:"") ".pfm"; if(f!=p){ close(p); p=f}}{print >f}' file

输出文件：

$ ls *.pfm -1
RUNX1.pfm
TFAP2A.pfm
TFAP2C.1.pfm
TFAP2C.pfm

每个文件的内容：

$ for i in *.pfm; do echo "Output File:$i"; cat "$i"; done
Output File:RUNX1.pfm
>MA0002.1   RUNX1
A  [    10     12      4      1      2      2      0      0      0      8     13 ]
C  [     2      2      7      1      0      8      0      0      1      2      2 ]
G  [     3      1      1      0     23      0     26     26      0      0      4 ]
T  [    11     11     14     24      1     16      0      0     25     16      7 ]
Output File:TFAP2A.pfm
>MA0003.1   TFAP2A
A  [     0      0      0     22     19     55     53     19      9 ]
C  [     0    185    185     71     57     44     30     16     78 ]
G  [   185      0      0     46     61     67     91    137     79 ]
T  [     0      0      0     46     48     19     11     13     19 ]
Output File:TFAP2C.1.pfm
>MA0003.3   TFAP2C
A  [  1706    137      0      0     33    575   3640   1012      0     31   1865 ]
C  [  1939    968   5309   5309   1646   2682    995    224     31   4726    798 ]
G  [   277   4340    139     11    658   1613    618   5309   5309    582   1295 ]
T  [  1386     47      0    281   2972    438     56      0      0     21   1350 ]

Output File:TFAP2C.pfm
>MA0003.3   TFAP2C
A  [  1706    137      0      0     33    575   3640   1012      0     31   1865 ]
C  [  1939    968   5309   5309   1646   2682    995    224     31   4726    798 ]
G  [   277   4340    139     11    658   1613    618   5309   5309    582   1295 ]
T  [  1386     47      0    281   2972    438     56      0      0     21   1350 ]

这可能适合您（GNU sed&csplit）：

使用csplit可以使用模式

^>

将文件拆分，即行开头的

表示新文件。然后使用两次sed调用来重命名文件。第一个输出原始文件名及其预期名称。第二个命令添加并执行move命令。将文件放在单独的目录中，并使用

head*

检查结果。

这可能适合您（GNU sed&csplit）：

使用csplit可以使用模式

^>

将文件拆分，即行开头的

表示新文件。然后使用两次sed调用来重命名文件。第一个输出原始文件名及其预期名称。第二个命令添加并执行move命令。将文件放在一个单独的目录中，并使用

head*

检查结果。

IMHO，如果有很多文件，这可能会导致错误“打开的文件太多”，因此关闭（文件名）可能会增加这一点的美观。@RavinderSingh13：谢谢Ravinder。那样做了。（y）非常感谢您的解决方案。我这里有个问题。有时同一个名字会被使用2到3次。例如，有3行起始为：>MA0018.1 CREB1、>MA0018.2 CREB1和>MA0018.3 CREB1，在这种情况下只保存了1个文件。如果存在具有该名称的文件，则可以将其保存为name.1.pfm和name.2.pfm，依此类推。。谢谢@蝙蝠侠完美。。中间的解决方案就是我想要的。非常感谢@菜鸟：给你，伙计！！恕我直言，如果有很多文件，这可能会导致错误“打开的文件太多”，所以关闭（文件名）可能会增加这一点的美感。@RavinderSingh13:谢谢Ravinder。那样做了。（y）非常感谢您的解决方案。我这里有个问题。有时同一个名字会被使用2到3次。例如，有3行起始为：>MA0018.1 CREB1、>MA0018.2 CREB1和>MA0018.3 CREB1，在这种情况下只保存了1个文件。如果存在具有该名称的文件，则可以将其保存为name.1.pfm和name.2.pfm，依此类推。。谢谢@蝙蝠侠完美。。中间的解决方案就是我想要的。非常感谢@菜鸟：给你，伙计！！这将不会

关闭

即使打开一个文件，因为您得到了

文件=$2

并且您正在写入

文件.pfm”

，

文件！=file.pfm“

，make

awk'/^>/{if（file）close（file）；file=$2.pfm”}{print>file}'infle

非常感谢您的详细解释。我这里有个问题。有时同一个名字会被使用2到3次。例如，有3行起始为：

>MA0018.1 CREB1

、

>MA0018.2 CREB1

和

>MA0018.3 CREB1

，在这种情况下只保存了1个文件。如果存在具有该名称的文件，则可以将其保存为name.1.pfm和name.2.pfm，依此类推。。谢谢@新手，你能测试一下吗？我相信我测试了一个多次出现的标记，它将行连接到同一个文件中，请在同一个文件中告诉我。@新手，你的所有要求都应该出现在问题中，而不是分散在评论中。@RavinderSingh13正如我前面所说的，我看不出我已经接受了你的答案。我看不到绿色的勾号，认为这个问题仍然没有接受任何答案，因此，基于顶部的答案和我对该答案的评论“这正是我想要做的”，我接受了该答案。这将不会

关闭

甚至打开一个文件，因为你得到了

文件=$2

一个