Unix基于2列从csv中删除重复行_Csv_Unix_Duplicates

Unix基于2列从csv中删除重复行

csv unix

Unix基于2列从csv中删除重复行,csv,unix,duplicates,Csv,Unix,Duplicates,我有一个包含多列的csv文件。有些可能在第4列col4上有重复项我需要删除出现重复的整行，只保留一行。此行的决定是通过从col1中获取最大值来做出的以下是一个例子：在第1行、第2行和第3行中发现重复，只应保留第三行，因为col1row3>col1row2>col1row1 现在，这段代码删除col4中的重复项，而不查看col1 awk '!seen[$4]++' myfile.csv 我想添加一个条件来检查每个重复项的col1，删除col1中值最低的项，并保留值最高的行n col1 输出

我有一个包含多列的csv文件。有些可能在第4列col4上有重复项

我需要删除出现重复的整行，只保留一行。此行的决定是通过从col1中获取最大值来做出的

以下是一个例子：

在第1行、第2行和第3行中发现重复，只应保留第三行，因为col1row3>col1row2>col1row1

现在，这段代码删除col4中的重复项，而不查看col1

awk '!seen[$4]++' myfile.csv

我想添加一个条件来检查每个重复项的col1，删除col1中值最低的项，并保留值最高的行n col1

输出应为：

col1，col2，col3，col4

3,y,b,123

1,z,c,999

谢谢大家!

@史密斯先生：你能试试下面的内容吗？如果这对你有帮助的话，请告诉我

awk -F"[[:space:]]+,[[:space:]]+"  'FNR==NR{A[$NF]=$1>A[$NF]?$1:A[$NF];next} (($NF) in A) && $1 == A[$NF] && A[$NF]{print}'   Input_file  Input_file

编辑：尝试：

awk -F","  'FNR==NR{A[$NF]=$1>A[$NF]?$1:A[$NF];next} (($NF) in A) && $1 == A[$NF] && A[$NF]{print}' Input_file   Input_file

EDIT2: Following is explanation as per OP's request:
awk -F","                               ##### starting awk here and mentioning field delimiter as comma(,).
'FNR==NR{                               ##### FNR==NR condition will be TRUE only when Input_file first time is getting read.
                                              Because we want to save the values of last field as an index in array A and whose value is $1.
                                              So FNR and NR are the awk's default keywords, where the only difference between NR and FNR is 
                                              both will tell the number of lines but FNR will be RESET each time a new Input_file is being read,
                                              where NR will be keep on increasing till all the Input_files are completed. So this condition will be 
                                              TRUE only when first Input_file is being read.
A[$NF]=                                 ##### Now making an array named A whose index is $NF(last field of that array), then I am checking a condition
$1>A[$NF]                               ##### Condition here is if current line's $1 is greater than the value of A[$NF]'s value(Off course $NF last fields
                                              will be same for them then only they will be compared, so if $1's value is greater than A[$NF]'s value then 
?                                       ##### Using ? wild character means if condition is TRUE then perform following statements.
$1                                      ##### which is to make the value of A[$NF] to $1(because as per your requirement we need the HIGHEST value)
:                                       ##### If condition is FALSE which I explained 2 lines before than : operator indicates to perform actions which are following it.
A[$NF];                                 ##### Keep the value of A[$NF] same as [$NF] no change in it.
next}                                   ##### next is an awk's in built keyword so it will skip all further statements and take the control to again start from
                                              very first statement, off course it is used to avoid the execution of statements while first time Input_file is being read.
(($NF) in A) && $1 == A[$NF] && A[$NF]{ ##### So these conditions will be executed only and only when 2nd time Input_file is being read. Checking here 
                                              if $NF(last field of current line) comes in array A and array A's value is equal to first field and array A's value is NOT NULL.
print                                   ##### If above all conditions are TRUE then print the current line of Input_file
}' Input_file   Input_file              ##### Mentioning the Input_files here.

不，这不清楚，你能不能把更多的信息和示例输入文件以及预期的输出放在帖子里，这样所有的人都能帮上忙。这里有一个输入和输出示例，请仔细阅读。我做了，结果是一样的，没有变化，副本还在那里。当然，它们会在那里，当你发布时，我猜你没有使用代码标签或其他，所以当时字段之间有空格，所以我给出了相应的解决方案，你能试试我编辑的解决方案吗？为什么在代码中输入两次文件？你能解释一下吗？最重要的原因之一是为了保持输出行的顺序与输入文件的行相同，虽然我已经用解释编辑了我的解决方案，但如果你对同一行有任何疑问，请告诉我。

awk -F","  'FNR==NR{A[$NF]=$1>A[$NF]?$1:A[$NF];next} (($NF) in A) && $1 == A[$NF] && A[$NF]{print}' Input_file   Input_file

EDIT2: Following is explanation as per OP's request:
awk -F","                               ##### starting awk here and mentioning field delimiter as comma(,).
'FNR==NR{                               ##### FNR==NR condition will be TRUE only when Input_file first time is getting read.
                                              Because we want to save the values of last field as an index in array A and whose value is $1.
                                              So FNR and NR are the awk's default keywords, where the only difference between NR and FNR is 
                                              both will tell the number of lines but FNR will be RESET each time a new Input_file is being read,
                                              where NR will be keep on increasing till all the Input_files are completed. So this condition will be 
                                              TRUE only when first Input_file is being read.
A[$NF]=                                 ##### Now making an array named A whose index is $NF(last field of that array), then I am checking a condition
$1>A[$NF]                               ##### Condition here is if current line's $1 is greater than the value of A[$NF]'s value(Off course $NF last fields
                                              will be same for them then only they will be compared, so if $1's value is greater than A[$NF]'s value then 
?                                       ##### Using ? wild character means if condition is TRUE then perform following statements.
$1                                      ##### which is to make the value of A[$NF] to $1(because as per your requirement we need the HIGHEST value)
:                                       ##### If condition is FALSE which I explained 2 lines before than : operator indicates to perform actions which are following it.
A[$NF];                                 ##### Keep the value of A[$NF] same as [$NF] no change in it.
next}                                   ##### next is an awk's in built keyword so it will skip all further statements and take the control to again start from
                                              very first statement, off course it is used to avoid the execution of statements while first time Input_file is being read.
(($NF) in A) && $1 == A[$NF] && A[$NF]{ ##### So these conditions will be executed only and only when 2nd time Input_file is being read. Checking here 
                                              if $NF(last field of current line) comes in array A and array A's value is equal to first field and array A's value is NOT NULL.
print                                   ##### If above all conditions are TRUE then print the current line of Input_file
}' Input_file   Input_file              ##### Mentioning the Input_files here.