Unix 基于共有2列正确连接两个文件_Unix_Join_Awk

Unix 基于共有2列正确连接两个文件

unix join awk

Unix 基于共有2列正确连接两个文件,unix,join,awk,Unix,Join,Awk,我有两个文件正试图根据列1和2加入/合并。它们看起来是这样的，file1（58210行）比file2（815530行）短得多，我想根据字段1和2找到这两个文件的交点作为索引： file1： 2L 25753 33158 2L 28813 33158 2L 31003 33158 2L 31077 33161 2L 31279 33161 3L 32124 45339 3L 33256 45339 .

我有两个文件正试图根据列

和

加入/合并。它们看起来是这样的，

file1

（

行）比

file2

（

行）短得多，我想根据字段

和

找到这两个文件的交点作为索引：

file1

：

2L      25753   33158
2L      28813   33158
2L      31003   33158
2L      31077   33161
2L      31279   33161
3L      32124   45339
3L      33256   45339
...

file2

：

2L      20242   0.5     0.307692307692308
2L      22141   0.32258064516129        0.692307692307692
2L      24439   0.413793103448276       0.625
2L      24710   0.371428571428571       0.631578947368421
2L      25753   0.967741935483871       0.869565217391304
2L      28813   0.181818181818182       0.692307692307692
2L      31003   0.36    0.666666666666667
2L      31077   0.611111111111111       0.931034482758621
2L      31279   0.75    1
3L      32124   0.558823529411765       0.857142857142857
3L      33256   0.769230769230769       0.90625
...

我一直在使用以下两个命令，但最终的行数不同：

awk 'FNR==NR{a[$1$2]=$3;next} {if($1$2 in a) print}' file1 file2 | wc -l
awk 'FNR==NR{a[$1$2]=$3;next} {if($1$2 in a) print}' file2 file1 | wc -l

我不知道为什么会发生这种情况，我尝试在比较之前进行排序，以防在两个文件中都有重复的行（基于列

和

），但这似乎没有帮助。（如有任何关于原因的见解，我们也将不胜感激）

我如何合并文件，以便只打印

file1

中具有相应列

和

的

file2

行，并添加

file1

的

，看起来像这样：

2L      25753   0.967741935483871       0.869565217391304    33158
2L      28813   0.181818181818182       0.692307692307692    33158
2L      31003   0.36    0.666666666666667    33158
2L      31077   0.611111111111111       0.931034482758621    33161
2L      31279   0.75    1    33161
3L      32124   0.558823529411765       0.857142857142857    45339
3L      33256   0.769230769230769       0.90625    45339

如果要逐行连接文件，请使用以下命令：

join -o 1.2,1.3,2.4,2.5,1.4 <(cat -n file1) <(cat -n file2)

您可以使用

join

命令，但需要在每个数据表中创建一个join字段。假设第1列中的值不是

2L

，则无论两个输入文件的排序或未排序性质如何，此代码都应工作：

tmp=${TMPDIR:-/tmp}/tmp.$$
trap "rm -f $tmp.?; exit 1" 0 1 2 3 13 15

awk '{print $1 ":" $2, $0}' file1 | sort > $tmp.1
awk '{print $1 ":" $2, $0}' file2 | sort > $tmp.2

join -o 2.2,2.3,2.4,2.5,1.4 $tmp.1 $tmp.2

rm -f $tmp.?
trap 0

如果您有

bash

和“流程替换”，或者如果您知道数据已经进行了适当的排序，则可以简化处理

我不完全清楚为什么你的代码不起作用，但我可能会使用

a[$1，$2]

作为下标；如果第1列中的某些值是纯数字的，并且在连接第1列和第2列时可能会混淆，那么这将减少您的麻烦。这就是为什么“密钥创建”

awk

脚本在字段之间使用冒号的原因

使用修改后的数据文件，如图所示：

文件1 文件2 （质询内容未变。）

输出看：

如果这不是您想要的，请澄清，也许发布一些更具代表性的示例输入/输出

上述代码的注释版本，以提供要求的解释：

awk ' # START SCRIPT

# IF the number of records read so far across all files is equal
#    to the number of records read so far in the current file, a
#    condition which can only be true for the first file read, THEN 
NR==FNR {

   # populate array "a" such that the value indexed by the first
   # 2 fields from this record in file1 is the value of the third
   # field from the first file.
   a[$1,$2]=$3

   # Move on to the next record so we don't do any processing intended
   # for records from the second file. This is like an "else" for the
   # NR==FNR condition.
   next

} # END THEN

# We only reach this part of the code if the above condition is false,
# i.e. if the current record is from file2, not from file1.

# IF the array index constructed from the first 2 fields of the current
#    record exist in array a, as would occur if these same values existed
#    in file1, THEN
($1,$2) in a {

   # print the current record from file2 followed by the value from file1
   # that occurred at field 3 of the record that had the same values for
   # field 1 and field 2 in file1 as the current record from file2.
   print $0, a[$1,$2]

} # END THEN

' file1 file2 # END SCRIPT

希望对您有所帮助。

您能举例说明第一列的不同之处吗？哪些字段应该用来连接行？还是应该逐行连接？您的示例数据没有应该匹配的行…您必须有一个在DOS上创建的文件和一个在UNIX或其他系统上创建的文件，因为一个或两个文件中的行末尾必须有某种控制字符，这会影响输出。在两个文件上都尝试“cat-v”来查看控制字符，在两个文件上都尝试dos2unix来修复它们。连接需要在第1列和第2列上，不是吗？而且，

join

似乎只适用于一列。@JonathanLeffler我的问题没有得到OP的任何回应，所以我只是假设连接是逐行进行的，所以我在catOK生成的行号上进行连接-足够公平；我不认为OP是这么想的，但我错过了

cat

命令中的

-n

（但后来我订阅了新泽西学校的

cat

设计，而“

cat

从伯克利回来挥舞着“旗帜”（肯·汤普森引用的一句话的转述）令我恼火）.@JonathanLeffler我甚至不明白，如果数据和第1列和第2列的连接要求都满足，他为什么会得到一行输出而不感到惊讶。@Serge我对sed不太了解，无法弄清楚发生了什么，但您更新的答案并没有产生所需的输出，它实际上增加了两个输入文件的行数。我不知道这是否有帮助，但字段1是字符，而字段2是数字。两个文件都已按字段1排序，然后按字段2进行数字排序。这样，您的输入中就没有匹配的行。该解决方案确实适用于您描述的问题。或者，您可能正在使用旧的、损坏的awk（Solaris上的/usr/bin/awk）。awk——版本告诉你什么？你是对的，你的解决方案有效。我查了你的档案，你举了个例子。我试着把上面的帖子归档。很抱歉弄错了。@tedee12345:请看上面修改的答案。@EdMorton我从你那里学到了很多技巧。这个答案有完整的解释，值得

漂亮的回答徽章+1.回答得好！祝贺您成为20K，您的awk大师ism完全当之无愧：）
tmp=${TMPDIR:-/tmp}/tmp.$$
trap "rm -f $tmp.?; exit 1" 0 1 2 3 13 15

awk '{print $1 ":" $2, $0}' file1 | sort > $tmp.1
awk '{print $1 ":" $2, $0}' file2 | sort > $tmp.2

join -o 2.2,2.3,2.4,2.5,1.4 $tmp.1 $tmp.2

rm -f $tmp.?
trap 0

2L      5753   33158
2L      8813   33158
2L      7885   33158
2L      7885   33159
2L      1279   33158
2L      5095   33158
2L      3256   33158
2L      5372   33158
2L      7088   33161
2L      5762   33161

2L      5095    0.666666666666667       1
2L      5372    0.5     0.925925925925926
2L      5762    0.434782608695652       0.580645161290323
2L      5904    0.571428571428571       0.869565217391304
2L      5974    0.434782608695652       0.694444444444444
2L      6353    0.785714285714286       0.84
2L      7088    0.590909090909091       0.733333333333333
2L      7885    0.714285714285714       0.864864864864865
2L      7902    0.642857142857143       0.810810810810811
2L      8263    0.833333333333333       0.787878787878788

2L 5095 0.666666666666667 1 33158
2L 5372 0.5 0.925925925925926 33158
2L 5762 0.434782608695652 0.580645161290323 33161
2L 7088 0.590909090909091 0.733333333333333 33161
2L 7885 0.714285714285714 0.864864864864865 33158
2L 7885 0.714285714285714 0.864864864864865 33159

awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2

$ cat file1
2L      5753   33158
2L      8813   33158
2L      7885   33159
2L      1279   33159
2L      5095   33158
$
$ cat file2
2L      8813    0.6    1.2
2L      5762    0.4    0.5
2L      1279    0.5    0.9
$
$ awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2
2L      8813    0.6    1.2 33158
2L      1279    0.5    0.9 33159
$

awk ' # START SCRIPT

# IF the number of records read so far across all files is equal
#    to the number of records read so far in the current file, a
#    condition which can only be true for the first file read, THEN 
NR==FNR {

   # populate array "a" such that the value indexed by the first
   # 2 fields from this record in file1 is the value of the third
   # field from the first file.
   a[$1,$2]=$3

   # Move on to the next record so we don't do any processing intended
   # for records from the second file. This is like an "else" for the
   # NR==FNR condition.
   next

} # END THEN

# We only reach this part of the code if the above condition is false,
# i.e. if the current record is from file2, not from file1.

# IF the array index constructed from the first 2 fields of the current
#    record exist in array a, as would occur if these same values existed
#    in file1, THEN
($1,$2) in a {

   # print the current record from file2 followed by the value from file1
   # that occurred at field 3 of the record that had the same values for
   # field 1 and field 2 in file1 as the current record from file2.
   print $0, a[$1,$2]

} # END THEN

' file1 file2 # END SCRIPT