Bash 搜索列中的重复项,添加值
转换文件input.csvBash 搜索列中的重复项,添加值,bash,csv,awk,Bash,Csv,Awk,转换文件input.csv id,location_id,organization_id,service_id,name,title,email,department 36,,,22,Joe Smith,third-party,john.smith@example.org,third-party Applications 18,11,,,Dave Genesy,Head of office,, 14,9,,,David Genesy,Library Director,, 22,14,,,Andr
id,location_id,organization_id,service_id,name,title,email,department
36,,,22,Joe Smith,third-party,john.smith@example.org,third-party Applications
18,11,,,Dave Genesy,Head of office,,
14,9,,,David Genesy,Library Director,,
22,14,,,Andres Espinoza, Manager Commanding Officer,,
awk -F, ' NR>1 { mails[$7]+=1;if ( mails[$7] > 1 ) { OFS=",";split($7,mail1,"@");$7=mail1[1]mails[$7]"@"mail1[2] } else { $0=$0 } }1' mailscsv
将字段分隔符设置为,然后创建由电子邮件地址键入的数组。每次遇到电子邮件地址时增加索引。如果地址不止一次出现,请根据“@”将地址拆分为另一个数组mail1。将$7设置为数组mail1的第一个索引(@之前的电子邮件地址),后跟电子邮件地址的邮件索引值,然后设置“@”和mail1的第二个索引(@之后的部分),如果电子邮件地址只出现一次,则按原样设置整行。使用1打印行。任务说明
如果另有规定,此任务将更容易:
读取csv-vFPAT='[^,]*|“[^”]*”
替换电子邮件字段$7=firstname.“.lastname域;}
count电子邮件发生次数emailcounts[$7]+
保持顺序的迭代器iter
为第二个循环保存非电子邮件字段不可变[++iter]=$1“,“$2”,“$3”,“$4”,“$5”,“$6”,“$8”
将电子邮件保存到第二个循环emails[iter]=$7
迭代不可变字典中的键for(不可变的iter)
如果出现超过1次,则更改电子邮件{if(emailcounts[emails[iter]]>1)
增加电子邮件迭代器emailiter[emails[iter]]++
向电子邮件添加迭代器email=gensub(/@/,emailiter[emails[iter]]“@”,“g”,emails[iter])
print immutables[iter],email
“经理、指挥官
因为它认为这是两个字段)。添加结果,修复“经理,指挥官”相关:在上述解决方案中,cat的使用是无用的。否则……解释得很好。@kvantour在效率方面同意。它甚至比无用更糟糕。但是,我个人可能会选择查看我在开始解析的文件是什么。在这种情况下,您可以使用在添加索引的字符串中没有逗号。有吗在添加索引的数组中没有逗号。15,9,,,maria Kramer,图书馆部门经理,mkramer@abc.com,16 10 Dave基因测试器dgenesy2@abc.com17 10 Maria Kramer图书馆部门经理mkramer2@abc.com18 11 Dave Genesy办公室主任dgenesy3@abc.com19,11,Elizabeth Meeks,分行经理,emeeks@abc.com,15,9,,,玛丽亚·克莱默,图书馆部经理,mkramer@abc.com,16 10 Dave基因测试器dgenesy2@abc.com17 10 Maria Kramer图书馆部门经理mkramer2@abc.com18 11 Dave Genesy办公室主任dgenesy3@abc.com19,11,Elizabeth Meeks,分行经理,emeeks@abc.com,
我不得不将输出字段分隔符(OFS)添加为;我已经修改了答案
id,location_id,organization_id,service_id,name,title,email,department
14,9,,,Dave Genesy,Library Director,dgenesy@abc.com,
14,9,,,David Genesy,Library Director,dgenesy2@abc.com,
15,9,,,maria Kramer,Library Divisions Manager,mkramer@abc.com,
26,18,,,Sharon Petersen,Administrator,spetersen@abc.com,
27,19,,,Shen Petersen,Administrator,spetersen2@abc.com,
id,location_id,organization_id,service_id,name,title,email,department
14,9,,,Dave Genesy,Library Director,dgenesy@abc.com,
14,9,,,David Genesy,Library Director,dgenesy@abc.com,
15,9,,,maria Kramer,Library Divisions Manager,mkramer@abc.com,
26,18,,,Sharon Petersen,Administrator,spetersen@abc.com,
27,19,,,Shen Petersen,Administrator,spetersen2@abc.com,
awk -F, ' NR>1 { mails[$7]+=1;if ( mails[$7] > 1 ) { OFS=",";split($7,mail1,"@");$7=mail1[1]mails[$7]"@"mail1[2] } else { $0=$0 } }1' mailscsv
<accounts.csv \
gawk -vFPAT='[^,]*|"[^"]*"' \
'
BEGIN {
OFS = ","
};
{
if ($7 == "") {
split($5,name," ");
firstname = substr(tolower(name[1]),1,1);
lastname = tolower(name[2]);
domain="@abc.com";
$7=firstname "." lastname domain;
};
emailcounts[$7]++;
immutables[++iter]=$1","$2","$3","$4","$5","$6","$8;
emails[iter]=$7;
}
END {
for (iter in immutables) {
if (emailcounts[emails[iter]] > 1) {
emailiter[emails[iter]]++;
email=gensub(/@/, emailiter[emails[iter]]"@", "g", emails[iter]);
} else {
email=emails[iter]
};
print immutables[iter], email
}
}'
id,location_id,organization_id,service_id,name,title,department,email
36,,,22,Joe Smith,third-party,third-party Applications,john.smith@example.org
18,11,,,Dave Genesy,Head of office,,d.genesy1@abc.com
14,9,,,David Genesy,Library Director,,d.genesy2@abc.com
22,14,,,Andres Espinoza,"Manager, Commanding Officer",,a.espinoza@abc.com