Awk 排除正则表达式并处理非常大的文件

Awk 排除正则表达式并处理非常大的文件,awk,sed,grep,Awk,Sed,Grep,我有一个文本文件需要更正。文件“exclude.txt”中的单词应从原始文本中删除 original.txt <block-list:block block-list:abbreviated-name="tost" block-list:name="test" /> <block-list:block block-list:abbreviated-name="tast" block-list:name="t

我有一个文本文件需要更正。文件“exclude.txt”中的单词应从原始文本中删除

original.txt

<block-list:block block-list:abbreviated-name="tost" block-list:name="test" />
<block-list:block block-list:abbreviated-name="tast" block-list:name="tart"/>
<block-list:block block-list:abbreviated-name="werk" block-list:name="work" />
<block-list:block block-list:abbreviated-name="wark" block-list:name="wrok" />
<block-list:block block-list:abbreviated-name="teble" block-list:name="table" />
<block-list:block block-list:abbreviated-name="tetal" block-list:name="total" />
<block-list:block block-list:abbreviated-name="exet" block-list:name="exit" />
预期的输出将如下所示

exclude.txt
tart
wrok
final.txt
<block-list:block block-list:abbreviated-name="tost" block-list:name="test" />
<block-list:block block-list:abbreviated-name="werk" block-list:name="work" />
<block-list:block block-list:abbreviated-name="teble" block-list:name="table" />
<block-list:block block-list:abbreviated-name="tetal" block-list:name="total" />
<block-list:block block-list:abbreviated-name="exet" block-list:name="exit" />
这是确定的,如果我只有2或3个字在排除文件。但问题是原始文件和排除文件都有数百万个单词


更新:

我忘了提到我在original.txt中有这一行

<block-list:block block-list:abbreviated-name="tart" block-list:name="test"/>

awk和grep+sed命令被终止。我更喜欢使用包含文件而不是排除文件(如果可能)。

您可以在
bash
中使用此
grep+sed
解决方案:

grep-vFf使用awk和
分隔符,因此基本上每个偶数字段都是一个单词(
blablabla”单词“blalbla”另一个单词“…
):

输出:

<block-list:block block-list:abbreviated-name="tost" block-list:name="test" />
<block-list:block block-list:abbreviated-name="werk" block-list:name="work" />
<block-list:block block-list:abbreviated-name="teble" block-list:name="table" />
<block-list:block block-list:abbreviated-name="tetal" block-list:name="total" />
<block-list:block block-list:abbreviated-name="exet" block-list:name="exit" />

grep-v-F-w-F exclude.txt original.txt
?是的。正确。请将其作为答案发布。只有-v-F起作用。为什么要添加-F-w?请参阅:不删除包含字符串的行,例如
start
。请参阅:
man grep
此解决方案比awk更好,因为保留了非标准行。我不需要学习一个新的解决方案wk使用此命令。:)这适用于排除列表。但如果我用包含列表替换文件并删除-v标志,则该命令将被终止。这是因为包含文件有1500万行,而排除文件只有15000行。是否可以使用包含文件(1500万行)而不是排除文件(15000行)?如果(!在使用include文件而不是exclude时,将(a中的t)转换为if((a中的t))。它按预期工作。
include.txt
test
work
table
total
exit
$ awk -F\" 'NR==FNR{a[$1];next}!($4 in a)' exclude original
<block-list:block block-list:abbreviated-name="tost" block-list:name="test" />
<block-list:block block-list:abbreviated-name="werk" block-list:name="work" />
<block-list:block block-list:abbreviated-name="teble" block-list:name="table" />
<block-list:block block-list:abbreviated-name="tetal" block-list:name="total" />
<block-list:block block-list:abbreviated-name="exet" block-list:name="exit" />
$ awk '
NR==FNR {                             # process the exclude file
    a[$1]                             # hash word
    next
}
{                                     # process the original file
    for(i=1;i<=NF;i++)                # loop every spave separated string
        if($i~/^block-list:name=/) {  # when we meet the desired string
            t=$i                      # copy string to  temp var
            gsub(/^[^"]+"|".*/,"",t)  # extract the word
            if(!(t in a))             # if the word is not to be excluded
                print                 # output record
            next                      # move the next record anyway
        }
}' exclude original