Bash 基于多字段匹配/不匹配的awk合并行
我们有一个csv:Bash 基于多字段匹配/不匹配的awk合并行,bash,csv,awk,merge,Bash,Csv,Awk,Merge,我们有一个csv: targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator , result_value, unit_value , experiment_date , discipline, activity_flag 51, cpd-7788990 ,9999,0, IC50 ,,10, uM , 2006-07-01 00:00:00 , Biochemical , 51, cpd-77889
targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator , result_value, unit_value , experiment_date , discipline, activity_flag
51, cpd-7788990 ,9999,0, IC50 ,,10, uM , 2006-07-01 00:00:00 , Biochemical ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM , 2006-08-01 00:00:00 , Enzyme ,
51, cpd-7788990 ,1212,2323, IC50 ,,100, uM , 2006-09-01 00:00:00 , Cell ,
我们的最终目标是:如果“cpd_编号”($2)相同,但“规程”($10)不是“单元”,则将“规程”($10)是“单元”而不是“单元”的行合并在一起。(学科只有三种选择:生化、细胞、酶。)以下是理想的输出。(注)新的“结果值”($7)=“规程”($10)为“细胞”的行的“结果值”($7)除以“规程”($10)为“生化”或“酶”的行的“结果值”($7) 一次做这件事看起来很复杂。因此,我试图首先合并整行:如果“cpd_编号”($2)相同,但“规程”($10)是“不同的”,则将“规程”($10)是“单元”的行与“规程”($10)不是“单元”的行合并。在此合并之后,我们可以使用awk进一步清理/重新命名标头。任何一位大师都能提供一些如何写这一行的想法吗?这只是一个玩具的例子。实际的csv文件相当大,因此以/^51/开头可能并不理想。谢谢
targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator , result_value, unit_value , experiment_date , discipline, activity_flag, targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator , result_value, unit_value , experiment_date , discipline, activity_flag
51, cpd-7788990 ,9999,0, IC50 ,,10, uM , 2006-07-01 00:00:00 , Biochemical , 51, cpd-7788990 ,1212,2323, IC50 ,,100, uM , 2006-09-01 00:00:00 , Cell ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM , 2006-08-01 00:00:00 , Enzyme , 51, cpd-7788990 ,1212,2323, IC50 ,,100, uM , 2006-09-01 00:00:00 , Cell ,
额外示例:
targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator, result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline
51, cpd-7788990 ,9999,0, IC50 ,,10, uM , 2006-07-01 00:00:00 , Biochemical ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM , 2006-08-01 00:00:00 , Enzyme ,
51, cpd-7788990 ,1212,2323, IC50 ,,100, uM , 2006-09-01 00:00:00 , Cell ,
51, cpd-7788990 ,8888,9999, IC50 ,,200, uM , 2006-09-01 00:00:00 , Cell ,
输出:
targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator, result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline
51,cpd-7788990,1212,2323, IC50 ,9999,0,IC50,,10,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,1212,2323, IC50 ,4444,5555,Ki,>,20,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme
51,cpd-7788990,8888,9999, IC50 ,9999,0,IC50,,20,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,8888,9999, IC50 ,4444,5555,Ki,>,40,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme
这是一个基于示例输入和最终所需输出的awk脚本。请随意调整以满足您的需要。它应该足以让你开始。需要两次传递到csv文件。在第一个过程中,它基于第二列构建一个数组,其中规程作为单元,在第二个过程中,它将行格式化在一起。由于您没有说明如何处理没有单元规程的行,下面的解决方案将忽略它们 脚本的内容。awk
BEGIN {
FS = " *, *" # Set input field sep to this regex
OFS = "," # Set output field sep to comma
}
NR==FNR { # In the first pass to the file
if ($(NF-1) == "Cell") { # If the second last field is Cell
flds[$2,$3,$4] = $3 OFS $4 OFS $5; # Create an array to store col 3,4 and 5 separated by comma
date[$2,$3,$4] = $9 # Store date
result[$2,$3,$4] = $7 # Store col 7
}
next # Move to the next record
}
{ # For the second pass to the file
for (fld in flds) { # For every entry in our array
split (fld, tmp, SUBSEP); # Split the composite key
if ($(NF-1) != "Cell" && tmp[1] == $2) { # If the last field is not Cell and first piece of key is same as col 2
line = $0 # Save the current line in a variable
$3 = flds[fld] OFS $3 # modify col3 to put the value from array in front of col3
$7 = result[fld] / $7 # Calculate the new result value
$9 = date[fld] OFS $9 # Add the date
$(NF-1) = "Cell" OFS $(NF-1) # Place the Cell text
NF-- # Remove the last field
print # print the line
$0 = line # Swap the modified line back
}
}
}
$(NF-1) == "Cell" { # If the last field is Cell don't print it
next
}
像这样运行:
$ awk -f script.awk file file
51,cpd-7788990,1212,2323,IC50,9999,0,IC50,,10,uM,2006-09-01 00:00:00,2006-07-01 00:00:00,Cell,Biochemical
51,cpd-7788990,8888,9999,IC50,9999,0,IC50,,20,uM,2006-09-01 00:00:00,2006-07-01 00:00:00,Cell,Biochemical
51,cpd-7788990,1212,2323,IC50,4444,5555,Ki,>,20,uM,2006-09-01 00:00:00,2006-08-01 00:00:00,Cell,Enzyme
51,cpd-7788990,8888,9999,IC50,4444,5555,Ki,>,40,uM,2006-09-01 00:00:00,2006-08-01 00:00:00,Cell,Enzyme
您可以在
BEGIN
块中包括头语句的打印 到目前为止你做了什么?谢谢jaypal!一个非常简洁的解决方案!但是,你能在每行加上一些注释吗?我试图完全理解脚本,以便修改它。当前的一个在当前示例上非常有效。但是,我只添加了一个“额外示例”,当前脚本将只保留一个单元格:酶和细胞:生化行,而不是额外示例中的两个单元格:酶和细胞:生化行。@Chubaka您的新数据实际上修改了整个答案。我已经更新了。请复制脚本并将其保存在文件中,然后像上面所示那样运行它。我已经添加了一些评论来指导您完成流程。
$ awk -f script.awk file file
51,cpd-7788990,1212,2323,IC50,9999,0,IC50,,10,uM,2006-09-01 00:00:00,2006-07-01 00:00:00,Cell,Biochemical
51,cpd-7788990,8888,9999,IC50,9999,0,IC50,,20,uM,2006-09-01 00:00:00,2006-07-01 00:00:00,Cell,Biochemical
51,cpd-7788990,1212,2323,IC50,4444,5555,Ki,>,20,uM,2006-09-01 00:00:00,2006-08-01 00:00:00,Cell,Enzyme
51,cpd-7788990,8888,9999,IC50,4444,5555,Ki,>,40,uM,2006-09-01 00:00:00,2006-08-01 00:00:00,Cell,Enzyme