Awk 针对大数据优化Bash脚本
我编写了一个bash脚本,试图从两个文件中获取一个新文件 文件1:Awk 针对大数据优化Bash脚本,awk,grep,Awk,Grep,我编写了一个bash脚本,试图从两个文件中获取一个新文件 文件1: 1000846364118,9,369,9901,0,2020.05.20 13:20:52,2020.07.14 16:38:11,2021.03.14 00:00:00,U,2020.07.14 16:38:11 1000683648398,9,369,9901,0,2019.05.04 19:50:39,2019.06.23 14:27:17,2019.12.31 23:59:59,U,2020.01.01 01:25:0
1000846364118,9,369,9901,0,2020.05.20 13:20:52,2020.07.14 16:38:11,2021.03.14 00:00:00,U,2020.07.14 16:38:11
1000683648398,9,369,9901,0,2019.05.04 19:50:39,2019.06.23 14:27:17,2019.12.31 23:59:59,U,2020.01.01 01:25:05
1000534726081,9,369,9901,0,2019.05.04 19:50:39,2019.06.23 14:27:17,2019.12.31 23:59:59,X,2020.01.01 01:25:05
文件2:
1000846364118;0;;2021.04.04;9914;100084636;ISATD;U;TEST;1234567890;2;;0;0;0;0;2020.10.12.00:00:00;0;0
1000830686890;0;;2021.03.02;9807;100083068;ISATD;U;TEST;1234567891;2;;0;0;0;0;2020.10.12.00:00:01;0;0
1000835819335;0;;2021.03.21;9990;100083581;ISATD;U;TEST;1234567892;2;;0;0;0;0;2020.10.12.00:00:03;0;0
1000683648398;0;;2020.10.31;9829;100068364;ISATD;U;TEST;1234567893;2;;0;0;0;0;2020.10.12.00:00:06;0;0
新文件将仅包含文件1中的行,其中包含模式“U”,并在文件2的第10个字段(123456789X)处有额外的列。因此,我的最终输出将如下所示:
1000846364118,9,369,9901,0,2020.05.20 13:20:52,2020.07.14 16:38:11,2021.03.14 00:00:00,U,2020.07.14 16:38:11,1234567890
1000683648398,9,369,9901,0,2019.05.04 19:50:39,2019.06.23 14:27:17,2019.12.31 23:59:59,U,2020.01.01 01:25:05,1234567893
我的脚本在下面,运行良好,但唯一的问题是我使用的数据非常大,要生成文件输出需要花费太多时间。我在每一步之后都放置了一个时间跨度,发现for循环部分需要几个小时才能生成几KB的数据,而我正在处理几百MB的数据。需要帮助来优化它
cat /dev/null > new_file
used_Serial_Number=`grep U file1 | awk -F "," '{print $1}'`
echo "Serial no extracted at `date`" # Till this portion is getting completed in 2-3mins
for i in $used_Serial_Number; do
msisdn=`grep $i file2 | awk -F ";" '{print $10}'`
grep $i file1 | awk -v msisdn=$msisdn -F "," 'BEGIN { OFS = "," } { print $0 , msisdn }' >> new_file
done
请您尝试使用GNU
awk
中显示的样本编写并测试以下内容。如果输入文件1的第9个字段可能是u
或u
,则将$9==“u”
更改为tolower($9)==“u”
,以匹配这两种情况
awk '
BEGIN{
FS=";"
OFS=","
}
FNR==NR{
a[$1]=$10
next
}
($1 in a) && $9=="U"{
print $0,a[$1]
}
' Input_file2 FS="," Input_file1
说明:添加上述内容的详细说明
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=";" ##Setting FS as ; here.
OFS="," ##Setting OFS as , here.
}
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when Input_file2 is being read.
a[$1]=$10 ##Creating array a with index $1 and value is $10 here.
next ##next will skip all further statements from here.
}
($1 in a) && $9=="U"{ ##Checking if $1 is in a and 9th field is U then do following.
print $0,a[$1] ##Printing current line along with value of a with index of $1 here.
}
' file2 FS="," file1 ##Mentioning Input_file2 then setting FS as , and mentioning Input_file1 here.
这就像一块宝石。非常感谢。您还可以向我推荐任何链接,让我了解您使用过的awk的更多功能吗?@AmanSingh,欢迎光临。我爱你们可以从这里开始学习的链接,干杯,继续学习,在这个伟大的论坛上继续分享:)