Awk grep无法从CSV文件中删除模式_Awk_Grep

Awk grep无法从CSV文件中删除模式

awk grep

Awk grep无法从CSV文件中删除模式,awk,grep,Awk,Grep,我有一个文件，它也需要清除一些网址。URL位于一个文件中，比如fileA和CSV文件B（这是大小为6-10GB的大型文件）。我尝试了下面的grep命令，但它在较新的fileB上不起作用 grep -vwF -f patterns.txt fileB.csv > result.csv 文件A的结构是一个URL列表，如下所示： URLs (header, single column) bwin.hu paradisepoker.li 和文件B： type|||URL|||Date|||Do

我有一个文件，它也需要清除一些网址。URL位于一个文件中，比如fileA和CSV文件B（这是大小为6-10GB的大型文件）。我尝试了下面的grep命令，但它在较新的fileB上不起作用

grep -vwF -f patterns.txt fileB.csv > result.csv

文件A的结构是一个URL列表，如下所示：

URLs (header, single column)
bwin.hu
paradisepoker.li

和文件B：

type|||URL|||Date|||Domain
1|||https://www.google.com|||1524024000|||google.com 
2|||www.bwin.hu|||1524024324|||bwin.hu

文件B的分隔符为|||

我对包括awk在内的所有解决方案持开放态度。谢谢

编辑：预期输出是CSV文件，其中保留与fileA中的域模式不匹配的所有行

type|||URL|||Date|||Domain
1|||https://www.google.com|||1524024000|||google.com

你能试试下面的吗

awk 'FNR==NR{a[$0];next} !($NF in a)' Input_filea FS="\\|\\|\\|" Input_fileb

或

输出如下

type|||URL|||Date|||Domain
1|||https://www.google.com|||1524024000|||google.com

解释：现在为上述代码添加解释

awk '                                          ##Starting awk program here.
FNR==NR{                                       ##Checking condition FNR==NR which will be TRUE when first Input_file named filea is being read.
  a[$0]                                        ##Creating an array named a whose index is $0(current line).
  next                                         ##next keyword will skip all further statements.
}                                              ##Closing block for condition FNR==NR here.
!($NF in a)                                    ##Checking condition if last field of current line is NOT present in array a for Input_fileb only.
                                               ##if condition is TRUE then no action is mentioned so by default print of current line will happen.
' filea FS="\\|\\|\\|" fileb                   ##Mentioning Input_file names and for fileb mentioning FS should be ||| escaped it here so that awk will consider it as a literal character.

您说您的文件B的分隔符是

| | |

，但看不到它，请您清楚地发布示例，以便我们能够完整了解您的问题。此处也需要预期的示例输出。请添加预期的输出示例以及更多详细信息，这是两个文件之间的区别吗？没有区别。但是那些与fileAThanks RavinderSingh13中的域模式不匹配的文件…这是awk线性行文尽可能高效吗…我的CSV很大@MallikKumar，它应该是好的，虽然它创建了一个单独的数组a，它将被存储在内存中，但我不认为这太糟糕了，等等。在我的代码中，文件a是

URL（头，单列）

fie，文件B是你的其他内容文件。因为我没有大数据，所以我没有测试它，但应该可以。告诉我？@RavinderSingh13……在简化数据集的同时，我忘了提到还有更多的列。但是给定的列的顺序是正确的..你能更改第4列（域）的代码吗..我试过（$4 in a），但得到的行数与CSV相同。我确信CSV中有一些匹配的域。@MallikKumar，那么可能您正在尝试的不是第四个域。尝试执行

awk-F'\\\\\\\'''{for（i=1；i
awk '                                          ##Starting awk program here.
FNR==NR{                                       ##Checking condition FNR==NR which will be TRUE when first Input_file named filea is being read.
  a[$0]                                        ##Creating an array named a whose index is $0(current line).
  next                                         ##next keyword will skip all further statements.
}                                              ##Closing block for condition FNR==NR here.
!($NF in a)                                    ##Checking condition if last field of current line is NOT present in array a for Input_fileb only.
                                               ##if condition is TRUE then no action is mentioned so by default print of current line will happen.
' filea FS="\\|\\|\\|" fileb                   ##Mentioning Input_file names and for fileb mentioning FS should be ||| escaped it here so that awk will consider it as a literal character.