Unix 是否有方法根据特定列提取所有重复记录？_Unix_Awk_Ksh

Unix 是否有方法根据特定列提取所有重复记录？

unix awk

Unix 是否有方法根据特定列提取所有重复记录？,unix,awk,ksh,Unix,Awk,Ksh,我试图从管道分隔文件中提取所有（仅）重复值我的数据文件有80万行和多列，我对第3列特别感兴趣。所以我需要获取第3列的重复值，并从该文件中提取所有重复的行然而，我能够做到这一点，如下所示 cat Report.txt | awk -F'|' '{print $3}' | sort | uniq -d >dup.txt while read dup do grep "$dup" Report.txt >>only_dup.txt done <dup.txt 我

我试图从管道分隔文件中提取所有（仅）重复值

我的数据文件有80万行和多列，我对第3列特别感兴趣。所以我需要获取第3列的重复值，并从该文件中提取所有重复的行

然而，我能够做到这一点，如下所示

cat Report.txt | awk -F'|' '{print $3}' | sort | uniq -d >dup.txt

while read dup
do
   grep "$dup" Report.txt >>only_dup.txt
done <dup.txt

我把上面的循环取下来，如下所示

cat Report.txt | awk -F'|' '{print $3}' | sort | uniq -d >dup.txt

while read dup
do
   grep "$dup" Report.txt >>only_dup.txt
done <dup.txt

我的预期输出不包括唯一记录：

1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements

这可能是您想要的：

$ awk -F'|' 'NR==FNR{cnt[$3]++; next} cnt[$3]>1' file file
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team

或者，如果文件太大，无法将所有键（$3个值）放入内存（对于800000行中唯一的$3值来说，这应该不是问题）：

EDIT2:根据Ed sir的建议，使用更有意义的数组名称（IMO）对我的建议进行了微调

awk '
match($0,/[^\|]*\|/){
  val=substr($0,RSTART+RLENGTH)
  if(!unique_check_count[val]++){
    numbered_indexed_array[++count]=val
  }
  actual_valued_array[val]=(actual_valued_array[val]?actual_valued_array[val] ORS:"")$0
  line_count_array[val]++
}
END{
  for(i=1;i<=count;i++){
    if(line_count_array[numbered_indexed_array[i]]>1){
      print actual_valued_array[numbered_indexed_array[i]]
    }
  }
}
'  Input_file

请尝试以下操作，以下操作将按输入文件中出现的行的相同顺序给出输出

awk '
match($0,/[^ ]* /){
  val=substr($0,RSTART+RLENGTH)
  if(!a[val]++){
    b[++count]=val
  }
  c[val]=(c[val]?c[val] ORS:"")$0
  d[val]++
}
END{
  for(i=1;i<=count;i++){
    if(d[b[i]]>1){
      print c[b[i]]
    }
  }
}
'  Input_file

对上述代码的解释：

awk'##在这里启动awk程序。
使用awk的match函数匹配（$0，/[^]*/）{###，该函数匹配正则表达式，直到第一个空格出现。
val=substr（$0，RSTART+RLENGTH）##创建值为子字符串的变量val，从RSTART+RLENGTH值的起点到行尾。
如果（！a[val]++）{###检查索引为val的数组的值是否为NULL，则进一步增加其索引。
b[++count]=val##创建索引为变量count的增量值且值为val variable的数组b。
}##此处为数组a的if条件关闭块。
c[val]=（c[val]？c[val]ORS:）$0##创建名为c的数组，该数组的索引为变量val，值为$0，并在每次出现时将其自身的值连接在一起。
d[val]++##创建一个名为d的数组，该数组的索引为变量val，每次光标来到这里，它的值都随着1不断增加。
}###在这里结束比赛。
结束{##开始此awk程序的结束块部分。
对于（i=1；i1）{##检查索引为b[i]的数组d的值是否大于1，然后进入块内部。
打印c[b[i]##打印索引为b[i]的数组c的值。
}
}
}
'输入文件###在此处提及输入文件名。

awk中的另一个：

$ awk -F\| '{                  # set delimiter
    n=$1                       # store number
    sub(/^[^|]*/,"",$0)        # remove number from string
    if($0 in a) {              # if $0 in a
        if(a[$0]==1)           # if $0 seen the second time
            print b[$0] $0     # print first instance
        print n $0             # also print current
    }
    a[$0]++                    # increase match count for $0
    b[$0]=n                    # number stored to b and only needed once
}' file

样本数据的输出：

2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team

此外，这是否有效：

$ sort -k 2 file | uniq -D -f 1

或

-k2,5

或smth。不，因为分隔符从空格改为管道。

两个改进步骤。
第一步：
之后

awk-F'|'{print$3}Report.txt | sort | uniq-d>dup.txt
#或
cut-d“|”-f3dup.txt

你可以用

grep -f <(sed 's/.*/^.*|.*|&|.*|/' dup.txt) Report.txt
# or without process substitution
sed 's/.*/^.*|.*|&|.*|/' dup.txt > dup.sed
grep -f dup.sed Report.txt

grep-f dup.sed
grep-f dup.sed Report.txt

第二步：

使用其他更好的答案中给出的

awk

。

因此分隔符从空格改为管道。也许等会儿再修好，我得走了。我的版本怎么样，先生，你对那一个有什么看法先生：）修正了改变的分隔符。酷。在Ed sir向我提到后，我也固定了分隔符，就像我要求你们检查我的代码一样，我也要求他：）@RavinderSingh13你们的变量名太长了。我自己也是一个b，c，d的人D:D:DLol第一个：D可能足够了，你。@JamesBrown是的，这应该是他们所需要的，但我开始认为文件很大，因为我不知道lac是什么，所以我想出了上面的第二个脚本：-）@EdMorton，++ve感谢您的优秀代码。我也认为第一个，然后OP告诉太多的行在输入_文件，所以我认为阅读输入_文件2次可能需要更多的时间（虽然我不确定它没有测试任何东西在这里）。我想请您考虑一下我的版本，先生，我还没有用大数据集测试过它。看看为什么您调用awk的shell循环如此缓慢（以及其他问题）。有关shell代码的其他问题，请参见和。在您的问题中包括您希望在这800000多行中具有多少唯一的$3值。如果您为数组提供了比a、b、c和d更有意义的名称，以使代码更易于理解，那么如果您仍然需要，我将仔细查看。命名变量时，根据它们的用途命名，而不是根据它们的实现方式命名。例如，

numbered\u indexed\u array

——它告诉我给定的数组是按数字索引的，但绝对不知道它是如何使用的。只要看一眼代码

numbered\u indexed\u array[++count]=val

，我就知道它是按数字索引的，所以这个名称没有用。想想这个数组包含什么，然后根据它命名会很有用。你有一个名为

val

的变量，它告诉我它是一个值。好的-所有东西都是一个值，所以这没有用。它有什么价值？看起来它是一个用于确定唯一性的键值，因此您可以将其命名为

key

，这是一个比

val

更有用的名称。那么

呢！唯一检查计数[val]+

？看起来这会更有用，因为

！查看[键]++

。现在，编号的数组[++count]=val怎么样？嗯

count

没用，算什么？看起来它是唯一键值的计数，所以整行应该是

keys[++numKeys]=key

。看起来

实际值数组[val]

应该是

key2recs[key]

，因为它似乎是将键映射到相关记录。等等…希望你不介意，但我编辑了你的答案，添加了一个版本的脚本，使用IMHO更有用、更有意义的变量名来演示我的意思。在t

awk '                                 ##Starting awk program here.
match($0,/[^ ]* /){                   ##Using match function of awk which matches regex till first space is coming.
  val=substr($0,RSTART+RLENGTH)       ##Creating variable val whose value is sub-string is from starting point of RSTART+RLENGTH value to till end of line.
  if(!a[val]++){                      ##Checking condition if value of array a with index val is NULL then go further and increase its index too.
    b[++count]=val                    ##Creating array b whose index is increment value of variable count and value is val variable.
  }                                   ##Closing BLOCK for if condition of array a here.
  c[val]=(c[val]?c[val] ORS:"")$0     ##Creating array named c whose index is variable val and value is $0 along with keep concatenating its own value each time it comes here.
  d[val]++                            ##Creating array named d whose index is variable val and its value is keep increasing with 1 each time cursor comes here.
}                                     ##Closing BLOCK for match here.
END{                                  ##Starting END BLOCK section for this awk program here.
  for(i=1;i<=count;i++){              ##Starting for loop from i=1 to till value of count here.
    if(d[b[i]]>1){                    ##Checking if value of array d with index b[i] is greater than 1 then go inside block.
      print c[b[i]]                   ##Printing value of array c whose index is b[i].
    }
  }
}
'  Input_file                         ##Mentioning Input_file name here.

$ awk -F\| '{                  # set delimiter
    n=$1                       # store number
    sub(/^[^|]*/,"",$0)        # remove number from string
    if($0 in a) {              # if $0 in a
        if(a[$0]==1)           # if $0 seen the second time
            print b[$0] $0     # print first instance
        print n $0             # also print current
    }
    a[$0]++                    # increase match count for $0
    b[$0]=n                    # number stored to b and only needed once
}' file

2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team

$ sort -k 2 file | uniq -D -f 1

awk -F'|' '{print $3}' Report.txt | sort | uniq -d >dup.txt
# or
cut -d "|" -f3 < Report.txt | sort | uniq -d >dup.txt

grep -f <(sed 's/.*/^.*|.*|&|.*|/' dup.txt) Report.txt
# or without process substitution
sed 's/.*/^.*|.*|&|.*|/' dup.txt > dup.sed
grep -f dup.sed Report.txt