Bash：在列X中保留所有具有重复值的行_Bash_Awk

Bash：在列X中保留所有具有重复值的行

bash awk

Bash：在列X中保留所有具有重复值的行,bash,awk,Bash,Awk,我有一个有几千行和20多列的文件。我现在只想保留第3列中与其他行中电子邮件地址相同的行文件：（名字；姓氏；电子邮件；…）我想保留所有具有匹配电子邮件地址的行。在这种情况下，预期输出为 Mike;Tyson;mike@tyson.com Tom;Boyden;tom@boyden.com Tom;Cruise;mike@tyson.com Mike;Myers;mike@tyson.com Andre;Agassi;tom@boyden.com 如果我使用 awk -F';' '!seen[

我有一个有几千行和20多列的文件。我现在只想保留第3列中与其他行中电子邮件地址相同的行

文件：（名字；姓氏；电子邮件；…）

我想保留所有具有匹配电子邮件地址的行。在这种情况下，预期输出为

Mike;Tyson;mike@tyson.com
Tom;Boyden;tom@boyden.com
Tom;Cruise;mike@tyson.com
Mike;Myers;mike@tyson.com
Andre;Agassi;tom@boyden.com

如果我使用

awk -F';' '!seen[$3]++' file

我将丢失电子邮件地址的第一个实例，在本例中是第1行和第2行，并且只保留重复的电子邮件地址

有没有办法保留所有线路？

此awk one liner将帮助您：

awk -F';' 'NR==FNR{a[$3]++;next}a[$3]>1' file file

它将文件传递两次，第一次计算发生次数，第二次将检查并输出

对于给定的输入示例，它将打印：

Mike;Tyson;mike@tyson.com
Tom;Boyden;tom@boyden.com
Tom;Cruise;mike@tyson.com
Mike;Myers;mike@tyson.com
Andre;Agassi;tom@boyden.com

查找重复的电子邮件地址：

sed -s 's/^.*;/;/;s/$/$/' < file.csv | sort | uniq -d > dups.txt

更新：

正如“Ed Morton”指出的，当电子邮件地址包含字符时，上述命令将失败，这些字符在正则表达式中具有特殊含义。这就需要转义电子邮件地址

一种方法是使用与Perl兼容的正则表达式。在PCRE中，转义序列

\Q

和

\E

标记字符串的开头和结尾，不应将其视为正则表达式。GNU grep支持带有选项

-P

的PCRE。但这不能与选项

-f

结合使用。这就需要使用类似于

xargs

的东西。但是

xargs

解释反斜杠并破坏正则表达式。为了防止它，必须使用选项

-0

Lessen了解到：如果不在AWK中编程，就很难正确实现它

sed -s 's/^.*;/;\\Q/;s/$/\\E$/' < file.csv | sort | uniq -d | tr '\n' '\0' > dups.txt
xargs -0 -i < dups.txt grep -P '{}' file.csv

sed-s/^.*\\Q/；s/$/\\E$/'dups.txt
xargs-0-i

如果输出顺序无关紧要，这里有一种一次性方法：

$ awk -F';' '$3 in first{print first[$3] $0; first[$3]=""; next} {first[$3]=$0 ORS}' file
Mike;Tyson;mike@tyson.com
Tom;Cruise;mike@tyson.com
Mike;Myers;mike@tyson.com
Tom;Boyden;tom@boyden.com
Andre;Agassi;tom@boyden.com

我认为@ceving只需要再进一步

假设所选列不是第一列或最后一列-

cut -f$col -d\; file             |      # slice out the right column
  tr '[[:upper:]]' '[[:lower:]]' |      # standardize case
  sort | uniq -d                 |      # sort and output only the dups
  sed 's/^/;/; s/$/;/;'          > dups # save the lowercased keys
grep -iFf dups file > subset.csv        # pull matching records

如果所选列是第一列或最后一列，则此选项将中断，但应保留原始版本的大小写和顺序

如果它可能是第一个或最后一个grep，那么将流填充到最后一个grep，然后进行清理-

sed 's/^/;/; s/$/;/;' file       |            # pad with leading/trailing delims
  grep -iFf dups                 |            # grab relevant records
sed 's/^;//; s/;$//;'            > subset.csv # strip the padding

请您在single

awk

中的单次读取输入文件中尝试以下内容

awk '
BEGIN{
  FS=";"
}
{
  mail[$3]++
  mailVal[$3]=($3 in mailVal?mailVal[$3] ORS:"")$0
}
END{
  for(i in mailVal){
    if(mail[i]>1){ print mailVal[i] }
  }
}' Input_file

说明：添加上述内容的详细说明

awk '                                                  ##Starting awk program from here.
BEGIN{                                                 ##Starting BEGIN section of this program from here.
  FS=";"                                               ##Setting field separator as ; here.
}
{
  mail[$3]++                                           ##Creating mail with index of 3rd field here and keep adding its value with 1 here.
  mailVal[$3]=($3 in mailVal?mailVal[$3] ORS:"")$0     ##Creating mailVal which has 3rd field as index and value is current line and keep concatinating to it wiht new line.
}
END{                                                   ##Starting END block of this program from here.
  for(i in mailVal){                                   ##Traversing through mailVal here.
    if(mail[i]>1){ print mailVal[i] }                  ##Checking condition if value is greater than 1 then printing its value here.
  }
}
' Input_file                                           ##Mentioning Input_file name here.

该grep将生成错误匹配，如

bar@example.com

将匹配

foobar@example.com

@EdMorton尝试一下，你会发现它不会产生错误的匹配。当输入文件包含重复的

mr时，尝试一下。jones@foo.com

和单个

mrsjones@foo.com

。你是对的，我给出的第一个例子不会有问题。@thanasisp不，不会有问题，因为你需要

锚（或其他东西）来避免不同的问题。OP还表示，在他们的真实数据中，他们有20多列，因此Paul关于只在一列中匹配的评论变得非常相关。对。将

s/$/$/

更改为

s/$//

也应该解决这个问题，但我们正在剥洋葱皮。这很吵，可以简化，我敢肯定，但它不会破坏嵌入的元字符、大小写差异或孤立的命中。它仍然存在一个问题，即如果电子邮件地址也可以出现在另一列中，那么可能会出现错误匹配。这个问题比规范中立即出现的问题更复杂，哈哈。期待有足够的空闲时间来解决这个问题。：）谢谢，效果很好。之后可以（再次）排序，没问题。我知道stackoverflow不喜欢这种问题。不过我还是要试试：有没有办法更改代码，使输出只显示具有重复邮件地址的行的第一个（而不是全部）实例？是的。在数组的名称中有一条巨大的线索，在那里可以找到第一个实例。仔细想想上面4条语句中的每一条都做了什么。如有必要，添加其他调试打印以更好地理解代码。尝试删除您认为不希望打印的文本。请参阅下面的我的评论，了解如何在自己尝试进行更改后仍无法获得帮助。

awk '
BEGIN{
  FS=";"
}
{
  mail[$3]++
  mailVal[$3]=($3 in mailVal?mailVal[$3] ORS:"")$0
}
END{
  for(i in mailVal){
    if(mail[i]>1){ print mailVal[i] }
  }
}' Input_file

awk '                                                  ##Starting awk program from here.
BEGIN{                                                 ##Starting BEGIN section of this program from here.
  FS=";"                                               ##Setting field separator as ; here.
}
{
  mail[$3]++                                           ##Creating mail with index of 3rd field here and keep adding its value with 1 here.
  mailVal[$3]=($3 in mailVal?mailVal[$3] ORS:"")$0     ##Creating mailVal which has 3rd field as index and value is current line and keep concatinating to it wiht new line.
}
END{                                                   ##Starting END block of this program from here.
  for(i in mailVal){                                   ##Traversing through mailVal here.
    if(mail[i]>1){ print mailVal[i] }                  ##Checking condition if value is greater than 1 then printing its value here.
  }
}
' Input_file                                           ##Mentioning Input_file name here.