Linux 两个文件之间的AWK部分字符串搜索_Linux_Bash_Shell_Unix_Awk

Linux 两个文件之间的AWK部分字符串搜索

linux bash shell unix awk

Linux 两个文件之间的AWK部分字符串搜索,linux,bash,shell,unix,awk,Linux,Bash,Shell,Unix,Awk,文件2：文件1： U1664246201||2020-03-01 00:00:00|2020-12-31 00:00:00|abc U1664246201||2020-03-01 00:00:00|2020-12-31 00:00:00|abc |R1664236401|2018-03-01 00:00:00|2020-12-31 00:00:00|abc U1664546501|R1664546401|2019-04-01 00:00:00|2020-12-30 00:00:00|a

文件2：

文件1：

U1664246201||2020-03-01 00:00:00|2020-12-31 00:00:00|abc
U1664246201||2020-03-01 00:00:00|2020-12-31 00:00:00|abc
|R1664236401|2018-03-01 00:00:00|2020-12-31 00:00:00|abc    
U1664546501|R1664546401|2019-04-01 00:00:00|2020-12-30 00:00:00|abc
U1774546301||2020-05-01 00:00:00|2020-12-31 00:00:00|abc

当前解决方案

U17745463
R16645464
R16642364

输出：

awk 'BEGIN {print "columns"} {FS=OFS="|"} NR==FNR{a[$1]; next} {for (i in a) if($1 != "" && $2 != ""){if(index($1, i)){print $0} else {if(index($2, i)){print $0}} } else{ if((index($1, i)) || (index($2, i))){print $0}}}' file2.txt file1.txt > result.txt

此解决方案提供输出，但在处理一百万条记录时需要更多时间，有时会挂起。对于这个问题有更好的解决方案吗？

这看起来正是grep的初衷，所以我想它一定更有效

|R1664236401|2018-03-01 00:00:00|2020-12-31 00:00:00|abc    
U1664546501|R1664546401|2019-04-01 00:00:00|2020-12-30 00:00:00|abc    
U1774546301||2020-05-01 00:00:00|2020-12-31 00:00:00|abc

请注意，

-f

选项指定从中获取模式的文件。

如果您选择

perl

，请尝试以下操作：

$ grep -f file1 file2
|R1664236401|2018-03-01 00:00:00|2020-12-31 00:00:00|abc
U1664546501|R1664546401|2019-04-01 00:00:00|2020-12-30 00:00:00|abc
U1774546301||2020-05-01  00:00:00|2020-12-31 00:00:00|abc

$ cat file1
U17745463
R16645464
R16642364

$ cat file2
U1664246201||2020-03-01 00:00:00|2020-12-31 00:00:00|abc
U1664246201||2020-03-01 00:00:00|2020-12-31 00:00:00|abc
|R1664236401|2018-03-01 00:00:00|2020-12-31 00:00:00|abc
U1664546501|R1664546401|2019-04-01 00:00:00|2020-12-30 00:00:00|abc
U1774546301||2020-05-01  00:00:00|2020-12-31 00:00:00|abc

sort file1>file1\u排序
排序-t“|”-k 1,1 file2>file2_排序_0#按第1列排序
排序-t“|”-k 2,2 file2>file2_排序_1#按第二列排序
perl-e'
对于$i（0..1）{#通过更改file2的顺序重复两次
打开（FH1，“文件1_排序”）或死亡；#设置文件1_排序的文件句柄
打开（FH2，“文件2_排序的_u$i”）或死亡；#设置文件2_排序的文件句柄\u n
而（！eof（FH1）| |！eof（FH2））{
除非（定义为$l1）{#如果标记为更新
$l1=#从FH1中读取下一项
chomp（$l1）#删除换行符
}
除非（定义为$l2）{#如果标记为更新
$l2=#从FH2中读取下一项
@ary=拆分（/\\\；/，$l2）；#在“|”上拆分
}
如果（$ary[$i]=~/^$l1/）{#测试匹配
打印$l2；#打印匹配的行
未定义$l1；#设置标记以更新FH1
}否则{
如果（$ary[$i]lt$l1）{#如果FH2小于FH1
未定义$l2；#设置标记以更新FH2
}否则{
未定义$l1；#否则更新FH1
}
}
}
}
'

如果我们逐行比较这两个文件，则需要
```
O（N**2）
```
计算和测试将花费很长时间
为了减少比较，我们提前对文件进行排序并使用与管道合并排序相同的方式比较
```
流
```
中的条目
由于在
```
file2
```
中有两列要比较，我们准备了两个文件对于
```
file2
```
，通过更改排序键
同样的逻辑也可以通过
```
awk
```
实现，但让我使用
```
perl
```
这一次

使用grep时，大文件内存耗尽错误。@rakeshkandukuri尝试将

-F

添加到grep选项中，以进行固定字符串匹配，而不是正则表达式；可能占用更少的内存：

grep-Ff file1 file2

。我认为在

awk

示例中应该交换

file2.txt

和

file1.txt

的顺序。您对

file1

的输入是否必须在

file2

的（前两个）字段开头匹配，或者您希望在这些字段中的任何位置匹配？

file1

的每个输入能否在

file2

中有多个匹配项？@RavinderSingh13您能帮我解决这个问题吗？您想搜索文件1中的值是否与文件2的第1列匹配？搜索文件1 coulmn与文件2的第1列或第2列匹配。。

sort file1 > file1_sorted
sort -t "|" -k 1,1 file2 > file2_sorted_0       # sort by the 1st column
sort -t "|" -k 2,2 file2 > file2_sorted_1       # sort by the 2nd column

perl -e '

for $i (0 .. 1) {                               # repeat twice by changing the order of file2
    open(FH1, "file1_sorted") or die;           # set filehandle for file1_sorted
    open(FH2, "file2_sorted_$i") or die;        # set filehandle for file2_sorted_n

    while (!eof(FH1) || !eof(FH2)) {
        unless (defined $l1) {                  # if marked to update
            $l1 = <FH1>;                        # read next item from FH1
            chomp($l1);                         # remove the newline
        }
        unless (defined $l2) {                  # if marked to update
            $l2 = <FH2>;                        # read next item from FH2
            @ary = split(/\|/, $l2);            # split on "|"
        }
        if ($ary[$i] =~ /^$l1/) {               # test the match
            print $l2;                          # print the matched line
            undef $l1;                          # set mark to update FH1
        } else {
            if ($ary[$i] lt $l1) {              # if FH2 is less than FH1
                undef $l2;                      # set mark to update FH2
            } else {
                undef $l1;                      # else update FH1
            }
        }
    }
}
'