Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/list/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用awk拆分文本文件_Awk - Fatal编程技术网

使用awk拆分文本文件

使用awk拆分文本文件,awk,Awk,示例文本文件如下所示 ID Z4WTH3_9ACTN Unreviewed; 182 AA. AC Z4WTH3; A0SD0SDF; AC Z12SDFG3; ADFFGDF; DT 11-JUN-2014, integrated into UniProtKB/TrEMBL. SQ SEQUENCE 182 AA; 20675 MW; B85D18AC3B1F0E75 CRC64; MNFLEYNKDE KLHFNYKKS

示例文本文件如下所示

ID   Z4WTH3_9ACTN            Unreviewed;       182 AA.
AC   Z4WTH3; A0SD0SDF;
AC   Z12SDFG3; ADFFGDF;
DT   11-JUN-2014, integrated into UniProtKB/TrEMBL.
SQ   SEQUENCE   182 AA;  20675 MW;  B85D18AC3B1F0E75 CRC64;
     MNFLEYNKDE KLHFNYKKSC GLWLIVVALI IFAATVIGGK QIINMSVFSF GYVAAFLSIN
//
ID   Z4WXU8_9ACTN            Unreviewed;       203 AA.
AC   Z4WXU8;
AC   QWERDFV1;
DT   11-JUN-2014, integrated into UniProtKB/TrEMBL.
SQ   SEQUENCE   203 AA;  23224 MW;  35F1AE4342F6B3AC CRC64;
     MDCKSIRSEV LWQVVRLREK LMNFLEYNKD EKLCFNYKKS CGLWLIVVAL IIFAATVIGG
//
ID   Z9JHX1_9GAMM            Unreviewed;       132 AA.
AC   Z9JHX1;
SQ   SEQUENCE   132 AA;  13880 MW;  0E09988C0F3ED155 CRC64;
     MKISVDTNVL ARAVLQDDAN QGRSASTLLK DASLIAVSLP CLCELVWILS RGAKLSKEDV
//
实际文件是一个100GB的文件 该文件仅包含一个“ID”行,并且始终以“ID”行开头。以“/”结尾

“AC”线可以是多个。我们必须将第一行“AC”的第一个元素作为文件名

需要根据“/”将此文件拆分为多个文件。 每个文件都应命名为以AC开头的行中的文本

因此,输出文件如下所示

Z4WTH3.txt

ID   Z4WTH3_9ACTN            Unreviewed;       182 AA.
AC   Z4WTH3; A0SD0SDF;
AC   Z12SDFG3; ADFFGDF;
DT   11-JUN-2014, integrated into UniProtKB/TrEMBL.
SQ   SEQUENCE   182 AA;  20675 MW;  B85D18AC3B1F0E75 CRC64;
     MNFLEYNKDE KLHFNYKKSC GLWLIVVALI IFAATVIGGK QIINMSVFSF GYVAAFLSIN
//
Z4WXU8.txt

ID   Z4WXU8_9ACTN            Unreviewed;       203 AA.
AC   Z4WXU8;
AC   QWERDFV1;
DT   11-JUN-2014, integrated into UniProtKB/TrEMBL.
SQ   SEQUENCE   203 AA;  23224 MW;  35F1AE4342F6B3AC CRC64;
     MDCKSIRSEV LWQVVRLREK LMNFLEYNKD EKLCFNYKKS CGLWLIVVAL IIFAATVIGG
//
Z9JHX1.txt

ID   Z9JHX1_9GAMM            Unreviewed;       132 AA.
AC   Z9JHX1;
SQ   SEQUENCE   132 AA;  13880 MW;  0E09988C0F3ED155 CRC64;
     MKISVDTNVL ARAVLQDDAN QGRSASTLLK DASLIAVSLP CLCELVWILS RGAKLSKEDV
//

下面的
awk
可能会对您有所帮助

awk '/^ID/{close(filename);val=$2;sub(/_.*/,"",val);filename=val".txt"} {print > filename}'  Input_file
解决方案二: 根据OP文件名应该来自字符串
AC
,所以现在也添加以下解决方案

awk '/^ID/{close(filename);first=$0 ORS;next} /^AC/{val=$2;sub(";","",val);filename=val".txt";print first $0 > filename;next} {print > filename}'  Input_file
或者,如果输入文件在所有部分中没有
ID
标记,那么我们可以在
AC
标记中编写
close
函数,如下所示:

awk '/^ID/{first=$0 ORS;next} /^AC/{close(filename);val=$2;sub(";","",val);filename=val".txt";print first $0 > filename;next} {print > filename}'  Input_file
说明:现在也添加解决方案说明:

awk '
/^ID/{                       ##Searching string ID here if it is present in any line then do following:
  first=$0 ORS;              ##Creating variable named first whose value is current line with ORS(output record separator).
  next}                      ##next is awk default keyword which will sip further statements.
/^AC/{                       ##Checking here condition if a line contains string AC then do following:
  close(filename);           ##Closing the file which was previously written heer so that we will NOT get too many open files issues.
  val=$2;                    ##Creating variable named val and keeping its value as 2nd field of current line.
  sub(";","",val);           ##Using sub utility of awk to subsitute semi colon with NULL in variable val here.
  filename=val".txt";        ##Creating variable named filename whose value is variable val and .txt(creating output file names here).
  print first $0 > filename; ##Printing variable first and current line in the output file here.
  next                       ##next will skip all further statements now.
}
{
  print > filename           ##Printing the current lines into output file whoever are NOT satisfying the above 2 conditions.
}
'  Input_file                ##Mentioning the Input_file name here.

下面的
awk
可能会对您有所帮助

awk '/^ID/{close(filename);val=$2;sub(/_.*/,"",val);filename=val".txt"} {print > filename}'  Input_file
解决方案二: 根据OP文件名应该来自字符串
AC
,所以现在也添加以下解决方案

awk '/^ID/{close(filename);first=$0 ORS;next} /^AC/{val=$2;sub(";","",val);filename=val".txt";print first $0 > filename;next} {print > filename}'  Input_file
或者,如果输入文件在所有部分中没有
ID
标记,那么我们可以在
AC
标记中编写
close
函数,如下所示:

awk '/^ID/{first=$0 ORS;next} /^AC/{close(filename);val=$2;sub(";","",val);filename=val".txt";print first $0 > filename;next} {print > filename}'  Input_file
说明:现在也添加解决方案说明:

awk '
/^ID/{                       ##Searching string ID here if it is present in any line then do following:
  first=$0 ORS;              ##Creating variable named first whose value is current line with ORS(output record separator).
  next}                      ##next is awk default keyword which will sip further statements.
/^AC/{                       ##Checking here condition if a line contains string AC then do following:
  close(filename);           ##Closing the file which was previously written heer so that we will NOT get too many open files issues.
  val=$2;                    ##Creating variable named val and keeping its value as 2nd field of current line.
  sub(";","",val);           ##Using sub utility of awk to subsitute semi colon with NULL in variable val here.
  filename=val".txt";        ##Creating variable named filename whose value is variable val and .txt(creating output file names here).
  print first $0 > filename; ##Printing variable first and current line in the output file here.
  next                       ##next will skip all further statements now.
}
{
  print > filename           ##Printing the current lines into output file whoever are NOT satisfying the above 2 conditions.
}
'  Input_file                ##Mentioning the Input_file name here.

另一种方法是使用
RS
(由于multichar
RS
,GNU awk)来分离记录:

$ gawk '
BEGIN {
    RS=ORS="\n//\n"          # record separators
}
{
    for(i=1;i<=NF;i++)       # go thru each field in record
        if($i=="AC") {       # once AC found
            f=$(i+1) "TXT"   # next one is the filename
            sub(/;/,".",f)   # replace ; with .
            print > f        # print to file (multiple AC:s lead to multiple files)
            close(f)         # close to avoid problem with too many open files
                             # overwrites files when files with same name
        }
}' file
在文件中:

$ cat Z9JHX1.TXT
ID   Z9JHX1_9GAMM            Unreviewed;       132 AA.
AC   Z9JHX1;
SQ   SEQUENCE   132 AA;  13880 MW;  0E09988C0F3ED155 CRC64;
     MKISVDTNVL ARAVLQDDAN QGRSASTLLK DASLIAVSLP CLCELVWILS RGAKLSKEDV
//

另一种方法是使用
RS
(由于multichar
RS
,GNU awk)来分离记录:

$ gawk '
BEGIN {
    RS=ORS="\n//\n"          # record separators
}
{
    for(i=1;i<=NF;i++)       # go thru each field in record
        if($i=="AC") {       # once AC found
            f=$(i+1) "TXT"   # next one is the filename
            sub(/;/,".",f)   # replace ; with .
            print > f        # print to file (multiple AC:s lead to multiple files)
            close(f)         # close to avoid problem with too many open files
                             # overwrites files when files with same name
        }
}' file
在文件中:

$ cat Z9JHX1.TXT
ID   Z9JHX1_9GAMM            Unreviewed;       132 AA.
AC   Z9JHX1;
SQ   SEQUENCE   132 AA;  13880 MW;  0E09988C0F3ED155 CRC64;
     MKISVDTNVL ARAVLQDDAN QGRSASTLLK DASLIAVSLP CLCELVWILS RGAKLSKEDV
//

对于多字符RS和RT,使用GNU awk:

awk -v RS='\n//\n' -v ORS= -F'[[:space:];]+' '{print $0 RT > ($7".txt")}' file
对于任何awk:

awk -F'[[:space:];]+' '
    $1 == "AC" { out = $2".txt" }
    { rec = rec $0 ORS }
    $0 == "//" {
        printf "%s", rec > out
        close out
        rec = ""
    }
' file

对于多字符RS和RT,使用GNU awk:

awk -v RS='\n//\n' -v ORS= -F'[[:space:];]+' '{print $0 RT > ($7".txt")}' file
对于任何awk:

awk -F'[[:space:];]+' '
    $1 == "AC" { out = $2".txt" }
    { rec = rec $0 ORS }
    $0 == "//" {
        printf "%s", rec > out
        close out
        rec = ""
    }
' file

请添加您尝试的代码。。。这个问答接近你需要的:请添加你尝试过的代码。。。这个问答接近你需要的:它工作得非常好。但我需要以“AC”开头的文件名表单行。@SiyaDiya,请检查我的第二个解决方案,如果这对您有帮助,请告诉我。这非常有效。非常感谢。我还想知道一件事。如果以AC开头的行包含由“;”分隔的多个id,如“AC Z4WXU8;E9PWJ4;Q6ZQB3;Q8BWI6;”,则必须使用每个id创建文件,并且内容相同。像Z4WXU8.txt、E9PWJ4.txt、Q6ZQB3.txt、Q8BWI6.txt等,将
close
移动到
/^AC/
块会有什么不同吗?如果
ID
AC
的顺序不同,文件可能会保持打开状态。@JamesBrown,是的,对,James先生,这就是为什么我问OP,如果实际文件没有/^ID/行,那么我们肯定可以将
close(filename)
放在
/^AC//code>标记中。这非常有效。但我需要以“AC”开头的文件名表单行。@SiyaDiya,请检查我的第二个解决方案,如果这对您有帮助,请告诉我。这非常有效。非常感谢。我还想知道一件事。如果以AC开头的行包含由“;”分隔的多个id,如“AC Z4WXU8;E9PWJ4;Q6ZQB3;Q8BWI6;”,则必须使用每个id创建文件,并且内容相同。像Z4WXU8.txt、E9PWJ4.txt、Q6ZQB3.txt、Q8BWI6.txt等,将
close
移动到
/^AC/
块会有什么不同吗?如果
ID
AC
的顺序不同,文件可能会保持打开状态。@JamesBrown,是的,对,James先生,这就是为什么我问OP实际文件是否没有/^ID/行,那么我们肯定可以将
close(filename)
放在
/^AC//code>标记中。输入3 GB文件时出错“awk:超出程序限制:最大字段数size=32767 FILENAME=“uniprot_sprot.dat”FNR=289522 NR=289522“听起来你的数据和你描述的不一样。在某些情况下,字段比您显示的要多。而且,听起来你没有使用GNU awk,据我所知,它没有字段限制。祝您好运。输入3 GB文件时出错“超出awk:程序限制:最大字段数size=32767 FILENAME=“uniprot_sprot.dat”FNR=289522 NR=289522”听起来您的数据与您描述的不一样。在某些情况下,字段比您显示的要多。而且,听起来你没有使用GNU awk,据我所知,它没有字段限制。祝你好运