Awk 针对大数据优化Bash脚本

Awk 针对大数据优化Bash脚本,awk,grep,Awk,Grep,我编写了一个bash脚本,试图从两个文件中获取一个新文件 文件1: 1000846364118,9,369,9901,0,2020.05.20 13:20:52,2020.07.14 16:38:11,2021.03.14 00:00:00,U,2020.07.14 16:38:11 1000683648398,9,369,9901,0,2019.05.04 19:50:39,2019.06.23 14:27:17,2019.12.31 23:59:59,U,2020.01.01 01:25:0

我编写了一个bash脚本,试图从两个文件中获取一个新文件

文件1:

1000846364118,9,369,9901,0,2020.05.20 13:20:52,2020.07.14 16:38:11,2021.03.14 00:00:00,U,2020.07.14 16:38:11
1000683648398,9,369,9901,0,2019.05.04 19:50:39,2019.06.23 14:27:17,2019.12.31 23:59:59,U,2020.01.01 01:25:05
1000534726081,9,369,9901,0,2019.05.04 19:50:39,2019.06.23 14:27:17,2019.12.31 23:59:59,X,2020.01.01 01:25:05
文件2:

1000846364118;0;;2021.04.04;9914;100084636;ISATD;U;TEST;1234567890;2;;0;0;0;0;2020.10.12.00:00:00;0;0
1000830686890;0;;2021.03.02;9807;100083068;ISATD;U;TEST;1234567891;2;;0;0;0;0;2020.10.12.00:00:01;0;0
1000835819335;0;;2021.03.21;9990;100083581;ISATD;U;TEST;1234567892;2;;0;0;0;0;2020.10.12.00:00:03;0;0
1000683648398;0;;2020.10.31;9829;100068364;ISATD;U;TEST;1234567893;2;;0;0;0;0;2020.10.12.00:00:06;0;0
新文件将仅包含文件1中的行,其中包含模式“U”,并在文件2的第10个字段(123456789X)处有额外的列。因此,我的最终输出将如下所示:

1000846364118,9,369,9901,0,2020.05.20 13:20:52,2020.07.14 16:38:11,2021.03.14 00:00:00,U,2020.07.14 16:38:11,1234567890
1000683648398,9,369,9901,0,2019.05.04 19:50:39,2019.06.23 14:27:17,2019.12.31 23:59:59,U,2020.01.01 01:25:05,1234567893
我的脚本在下面,运行良好,但唯一的问题是我使用的数据非常大,要生成文件输出需要花费太多时间。我在每一步之后都放置了一个时间跨度,发现for循环部分需要几个小时才能生成几KB的数据,而我正在处理几百MB的数据。需要帮助来优化它

cat /dev/null > new_file

used_Serial_Number=`grep U file1 | awk -F "," '{print $1}'`

echo "Serial no extracted  at `date`"  # Till this portion is getting completed in 2-3mins

for i in $used_Serial_Number; do

msisdn=`grep $i file2 | awk -F ";" '{print $10}'`

grep $i file1 | awk -v msisdn=$msisdn -F "," 'BEGIN { OFS = "," } { print $0 , msisdn }' >> new_file

done


请您尝试使用GNU
awk
中显示的样本编写并测试以下内容。如果输入文件1的第9个字段可能是
u
u
,则将
$9==“u”
更改为
tolower($9)==“u”
,以匹配这两种情况

awk '
BEGIN{
  FS=";"
  OFS=","
}
FNR==NR{
  a[$1]=$10
  next
}
($1 in a) && $9=="U"{
  print $0,a[$1]
}
' Input_file2 FS="," Input_file1
说明:添加上述内容的详细说明

awk '                    ##Starting awk program from here.
BEGIN{                   ##Starting BEGIN section from here.
  FS=";"                 ##Setting FS as ; here.
  OFS=","                ##Setting OFS as , here.
}
FNR==NR{                 ##Checking condition if FNR==NR which will be TRUE when Input_file2 is being read.
  a[$1]=$10              ##Creating array a with index $1 and value is $10 here.
  next                   ##next will skip all further statements from here.
}
($1 in a) && $9=="U"{    ##Checking if $1 is in a and 9th field is U then do following.
  print $0,a[$1]         ##Printing current line along with value of a with index of $1 here.
}
' file2 FS="," file1     ##Mentioning Input_file2 then setting FS as , and mentioning Input_file1 here.

这就像一块宝石。非常感谢。您还可以向我推荐任何链接,让我了解您使用过的awk的更多功能吗?@AmanSingh,欢迎光临。我爱你们可以从这里开始学习的链接,干杯,继续学习,在这个伟大的论坛上继续分享:)