Bash 如何在csv文件中将特定字符串前的逗号替换为\n

Bash 如何在csv文件中将特定字符串前的逗号替换为\n,bash,awk,sed,Bash,Awk,Sed,我有一个csv文件,我想在GCA.*之后用\n替换逗号 输入: ASM190063v1,Escherichia coli(E.coli),strain=D3,562,SAMN03252421,PRJNA269191,Nanjing Agricultural University,2016-12-12,n/a,major,Complete Genome,full,Newbler v. 2.7,30-80x,Illumina Miseq; Roche 454 GS Junior,GCA_001900

我有一个
csv
文件,我想在
GCA.*
之后用
\n
替换逗号

输入:

ASM190063v1,Escherichia coli(E.coli),strain=D3,562,SAMN03252421,PRJNA269191,Nanjing Agricultural University,2016-12-12,n/a,major,Complete Genome,full,Newbler v. 2.7,30-80x,Illumina Miseq; Roche 454 GS Junior,GCA_001900635.1,ASM301855v1,Escherichia coli (E. coli),strain=2013C-4225,562,SAMN08579596,PRJNA218110,CDC,2018-3-26,n/a,major,Complete Genome,full,HGAP v. 3,yes,76.725x,PacBio,ASM330895v1,Escherichia coli (E. coli),strain=2017C-4109,562,SAMN09534373,PRJNA218110,CDC,2018-7-10,n/a,major,Complete Genome,full,HGAP v. 3,yes,286.7X,PacBio 
ASM190063v1,Escherichia coli(E.coli),strain=D3,562,SAMN03252421,PRJNA269191,Nanjing Agricultural University,2016-12-12,n/a,major,Complete Genome,full,Newbler v. 2.7,30-80x,Illumina Miseq; Roche 454 GS Junior,GCA_001900635.1
ASM301855v1,Escherichia coli (E. coli),strain=2013C-4225,562,SAMN08579596,PRJNA218110,CDC,2018-3-26,n/a,major,Complete Genome,full,HGAP v. 3,yes,76.725x,PacBio
ASM330895v1,Escherichia coli (E. coli),strain=2017C-4109,562,SAMN09534373,PRJNA218110,CDC,2018-7-10,n/a,major,Complete Genome,full,HGAP v. 3,yes,286.7X,PacBio 
sed 's/ASM*/\n&/' ordered_lines_per_genome.csv > assembly_report_table.csv
所需输出:

ASM190063v1,Escherichia coli(E.coli),strain=D3,562,SAMN03252421,PRJNA269191,Nanjing Agricultural University,2016-12-12,n/a,major,Complete Genome,full,Newbler v. 2.7,30-80x,Illumina Miseq; Roche 454 GS Junior,GCA_001900635.1,ASM301855v1,Escherichia coli (E. coli),strain=2013C-4225,562,SAMN08579596,PRJNA218110,CDC,2018-3-26,n/a,major,Complete Genome,full,HGAP v. 3,yes,76.725x,PacBio,ASM330895v1,Escherichia coli (E. coli),strain=2017C-4109,562,SAMN09534373,PRJNA218110,CDC,2018-7-10,n/a,major,Complete Genome,full,HGAP v. 3,yes,286.7X,PacBio 
ASM190063v1,Escherichia coli(E.coli),strain=D3,562,SAMN03252421,PRJNA269191,Nanjing Agricultural University,2016-12-12,n/a,major,Complete Genome,full,Newbler v. 2.7,30-80x,Illumina Miseq; Roche 454 GS Junior,GCA_001900635.1
ASM301855v1,Escherichia coli (E. coli),strain=2013C-4225,562,SAMN08579596,PRJNA218110,CDC,2018-3-26,n/a,major,Complete Genome,full,HGAP v. 3,yes,76.725x,PacBio
ASM330895v1,Escherichia coli (E. coli),strain=2017C-4109,562,SAMN09534373,PRJNA218110,CDC,2018-7-10,n/a,major,Complete Genome,full,HGAP v. 3,yes,286.7X,PacBio 
sed 's/ASM*/\n&/' ordered_lines_per_genome.csv > assembly_report_table.csv
我的尝试:

ASM190063v1,Escherichia coli(E.coli),strain=D3,562,SAMN03252421,PRJNA269191,Nanjing Agricultural University,2016-12-12,n/a,major,Complete Genome,full,Newbler v. 2.7,30-80x,Illumina Miseq; Roche 454 GS Junior,GCA_001900635.1,ASM301855v1,Escherichia coli (E. coli),strain=2013C-4225,562,SAMN08579596,PRJNA218110,CDC,2018-3-26,n/a,major,Complete Genome,full,HGAP v. 3,yes,76.725x,PacBio,ASM330895v1,Escherichia coli (E. coli),strain=2017C-4109,562,SAMN09534373,PRJNA218110,CDC,2018-7-10,n/a,major,Complete Genome,full,HGAP v. 3,yes,286.7X,PacBio 
ASM190063v1,Escherichia coli(E.coli),strain=D3,562,SAMN03252421,PRJNA269191,Nanjing Agricultural University,2016-12-12,n/a,major,Complete Genome,full,Newbler v. 2.7,30-80x,Illumina Miseq; Roche 454 GS Junior,GCA_001900635.1
ASM301855v1,Escherichia coli (E. coli),strain=2013C-4225,562,SAMN08579596,PRJNA218110,CDC,2018-3-26,n/a,major,Complete Genome,full,HGAP v. 3,yes,76.725x,PacBio
ASM330895v1,Escherichia coli (E. coli),strain=2017C-4109,562,SAMN09534373,PRJNA218110,CDC,2018-7-10,n/a,major,Complete Genome,full,HGAP v. 3,yes,286.7X,PacBio 
sed 's/ASM*/\n&/' ordered_lines_per_genome.csv > assembly_report_table.csv
使用GNU时:

sed 's/\(GCA_[^,]*\),/\1\n/g' input.csv
  • \(GCA\[^,]*\),
    :匹配
    GCA*
    ,后跟逗号。
    \(…\)
    定义了一个组,我们以后可以在替换字符串中使用该组
  • 替换
    \1\n
    :从匹配中插入组(“GCA*”),并追加换行符
要直接更改文件,请执行以下操作:

sed -i 's/\(GCA_[^,]*\),/\1\n/g' input.csv
或通过注释修复命令行:

sed 's/ASM[^,]*/\n&/g' input.csv
或者更好:为了防止逗号尾随:

sed 's/,\(ASM[^,]*\)/\n\1/g' input.csv
这可能就是您想要的:

$ sed 's/,/\n/16;P;D' file
ASM190063v1,Escherichia coli(E.coli),strain=D3,562,SAMN03252421,PRJNA269191,Nanjing Agricultural University,2016-12-12,n/a,major,Complete Genome,full,Newbler v. 2.7,30-80x,Illumina Miseq; Roche 454 GS Junior,GCA_001900635.1
ASM301855v1,Escherichia coli (E. coli),strain=2013C-4225,562,SAMN08579596,PRJNA218110,CDC,2018-3-26,n/a,major,Complete Genome,full,HGAP v. 3,yes,76.725x,PacBio
ASM330895v1,Escherichia coli (E.coli),strain=2017C-4109,562,SAMN09534373,PRJNA218110,CDC,2018-7-10,n/a,major,Complete Genome,full,HGAP v. 3,yes,286.7X,PacBio
  • s/,/\n/16
    :将第16个逗号
    替换为换行符
    \n
  • P
    :将行打印到第一个换行符
    \n
  • D
    :删除打印文本,并使用剩余文本再次开始循环


它基于一个很棒的by。

您应该删除
*
并添加
g
,用于全局:

sed 's/ASM/\n&/g' ordered_lines_per_genome.csv > assembly_report_table.csv
如果不需要逗号,可以使用

sed 's/,ASM/\nASM/g' ordered_lines_per_genome.csv > assembly_report_table.csv
为了好玩,请使用awk:

awk 'BEGIN {RS="ASM"} NF {print "ASM" $0}' ordered_lines_per_genome.csv
如果不希望在行的末尾使用逗号,可以使用

awk 'BEGIN {RS="[,]*ASM"} NF {print "ASM" $0}' ordered_lines_per_genome.csv
awk解决方案:

$ awk -F, '{i=0;while((++i)<=NF)printf $i ((!(i%16) || i==NF)? ORS : ",")}' mb.csv
ASM190063v1,Escherichia coli(E.coli),strain=D3,562,SAMN03252421,PRJNA269191,Nanjing Agricultural University,2016-12-12,n/a,major,Complete Genome,full,Newbler v. 2.7,30-80x,Illumina Miseq; Roche 454 GS Junior,GCA_001900635.1
ASM301855v1,Escherichia coli (E. coli),strain=2013C-4225,562,SAMN08579596,PRJNA218110,CDC,2018-3-26,n/a,major,Complete Genome,full,HGAP v. 3,yes,76.725x,PacBio
ASM330895v1,Escherichia coli (E. coli),strain=2017C-4109,562,SAMN09534373,PRJNA218110,CDC,2018-7-10,n/a,major,Complete Genome,full,HGAP v. 3,yes,286.7X,PacBio 
即:

awk '{print gensub(",ASM","\nASM","g")}' ordered_lines_per_genome.csv > assembly_report_table.csv

为您准备。

使用Perl并假设id以ASM开头

$ cat maryem.txt
ASM190063v1,Escherichia coli(E.coli),strain=D3,562,SAMN03252421,PRJNA269191,Nanjing Agricultural University,2016-12-12,n/a,major,Complete Genome,full,Newbler v. 2.7,30-80x,Illumina Miseq; Roche 454 GS Junior,GCA_001900635.1,ASM301855v1,Escherichia coli (E. coli),strain=2013C-4225,562,SAMN08579596,PRJNA218110,CDC,2018-3-26,n/a,major,Complete Genome,full,HGAP v. 3,yes,76.725x,PacBio,ASM330895v1,Escherichia coli (E. coli),strain=2017C-4109,562,SAMN09534373,PRJNA218110,CDC,2018-7-10,n/a,major,Complete Genome,full,HGAP v. 3,yes,286.7X,PacBio
$ perl -pe ' s/([^^]ASM.+?,)/\n$1/g; s/^,//mg; ' maryem.txt
ASM190063v1,Escherichia coli(E.coli),strain=D3,562,SAMN03252421,PRJNA269191,Nanjing Agricultural University,2016-12-12,n/a,major,Complete Genome,full,Newbler v. 2.7,30-80x,Illumina Miseq; Roche 454 GS Junior,GCA_001900635.1
ASM301855v1,Escherichia coli (E. coli),strain=2013C-4225,562,SAMN08579596,PRJNA218110,CDC,2018-3-26,n/a,major,Complete Genome,full,HGAP v. 3,yes,76.725x,PacBio
ASM330895v1,Escherichia coli (E. coli),strain=2017C-4109,562,SAMN09534373,PRJNA218110,CDC,2018-7-10,n/a,major,Complete Genome,full,HGAP v. 3,yes,286.7X,PacBio
$

你试过什么?您已使用awk和sed标记了您的问题,因此我希望在其中看到一些您需要帮助的awk和sed代码。我尝试使用sed命令,但sed的/ASM*/\n&/'ordered_lines_per_genome.csv>assembly_report_table.csv无效。请将您的尝试添加到您的问题中。(这可能会停止,也许会扭转投票结果。)我们很乐意帮助您修复代码,但您的代码应该是问题的一部分。此外,您的问题表明您希望在
GCA_uuu
之后添加一个换行符,但在示例输出中,第二个换行符不在此文本之后。你能澄清一下吗?磁贴上说你想在字符串之前替换(我想是
ASM
),你想要的输出和你的尝试证实了这一点。问题的第一行应该更改,您提到了与此无关的内容。在BSD或macOS中不起作用。如果您不知道OP是否使用与您相同的版本,最好构造您的答案,使其可移植。@ghoti限制为“GNU sed”@ghoti您能提供一个可移植的答案吗。我想了解更多。问题是非GNU sed不会像
\n
那样解释ANSI反斜杠符号。在替换字符串中嵌入换行符的标准方法是使用一个带反斜杠的文字换行符。您可以在bash中使用格式引用来实现这一点,例如:
sed$'s/,ASM/\\\nASM/g'input.csv
。您可以在bash手册页的“QUOTING”下阅读关于
$'..'
。虽然在这种情况下,我认为OP实际上是在寻找每16个字段拆分一次,正如mickp所建议的,因此:
sed$'s/,/\\\n/16;PD'input.csv
可能是我的解决方案。关于
sed
命令的确切含义的一行解释将非常好。它当然来自一个很好的答案,但它是对另一个问题的回答。OP没有指定GCA总是在第16次出现时@Tomalak补充道:)@TrebuchetMS看起来OP有一条很长的线,只需要格式化成一个合适的
csv
。我认为每个
csv
行具有相同数量的字段是一个合理的假设,因此我可以使用常量,例如
16
。我支持该假设。此外,OP的示例输入已经相互矛盾:在第二条记录中没有
CGA\uuu
。执行此第16个逗号方法的另一种方法:
xargs-d,-l16 echo