如何改进GNUwin32 join命令?

如何改进GNUwin32 join命令?,join,awk,gawk,gnuwin32,Join,Awk,Gawk,Gnuwin32,我无法使用join生成所需的结果 我在64位Windows 7上运行GNUwin32。我正在运行join版本5.3.0.1936和gawk版本3.1.6.2962 输入以下两个表: 表1 UID_C CID C000002 31799 C000002 31800 C000386 14950 C000386 9807916 C000386 10255083 C008114 5318432 C008117 799 C008117 444150 C008117 46878464 表2 UID_C

我无法使用join生成所需的结果

我在64位Windows 7上运行GNUwin32。我正在运行join版本5.3.0.1936和gawk版本3.1.6.2962

输入以下两个表:

表1

UID_C   CID
C000002 31799
C000002 31800
C000386 14950
C000386 9807916
C000386 10255083
C008114 5318432
C008117 799
C008117 444150
C008117 46878464
表2

UID_C   CID name
C000002 31799   bevonium
C000002 31800   bevonium
C002284 24832095    hypromellose
C008117 799 indoleglycerol phosphate
C008117 444150  indoleglycerol phosphate
C008117 46878464    indoleglycerol phosphate
我在bat文件中使用以下命令:

C:\gnuwin32\bin\join -t"|" -1 1 -2 1 -a1 -a2 -e "NULL" -o "0,1.2,2.2,2.3" C:\directory\Table_1.txt C:\directory\Table_2.txt > C:\directory\Table_3.txt
在我关于stackoverflow的插图中,表格使用制表符进行格式化以便于阅读,但实际上我使用管道作为输入和输出分隔符

输出如下表所示:

表3

UID_C   CID CID name
C000002 31800   31799   bevonium
C000002 31800   31800   bevonium
C000002 31799   31799   bevonium
C000002 31799   31800   bevonium
C000386 10255083    NULL    NULL
C000386 9807916 NULL    NULL
C000386 14950   NULL    NULL
C002284 NULL    24832095    hypromellose
C008114 5318432 NULL    NULL
C008117 46878464    799 indoleglycerol phosphate
C008117 46878464    444150  indoleglycerol phosphate
C008117 46878464    46878464    indoleglycerol phosphate
C008117 444150  799 indoleglycerol phosphate
C008117 444150  444150  indoleglycerol phosphate
C008117 444150  46878464    indoleglycerol phosphate
C008117 799 799 indoleglycerol phosphate
C008117 799 444150  indoleglycerol phosphate
C008117 799 46878464    indoleglycerol phosphate
所需输出为:

表4

UID_C   CID name
C000002 31799   bevonium
C000002 31800   bevonium
C000386 14950   NULL
C000386 9807916 NULL
C000386 10255083    NULL
C002284 24832095    hypromellose
C008114 5318432 NULL
C008117 799 indoleglycerol phosphate
C008117 444150  indoleglycerol phosphate
C008117 46878464    indoleglycerol phosphate
如何更改join命令以生成所需的输出

或者,我应该如何使用awk作为表3的后处理来生成表4


提前感谢您的建议。

我认为您需要比加入providex更多的逻辑:

awk -F"|" -v "OFS=|" '
    NR==FNR {uid_cid[$1 OFS $2]=1; next}
    { 
        key = $1 OFS $2
        if (key in uid_cid) {
            delete uid_cid[key]
        }
        print
    }
    END {
        for (key in uid_cid) {
            print key, "NULL"
        }
    }
' Table_1 Table_2 | sort -k1,1 -k2,2n -t "|"

你的建议需要一些帮助。列_1是UID(为了简单起见,我们去掉后面的下划线)。第2列为CID。第3列是名称。帮我把它翻译成awk短语:{uid\u cid[$1 of s$2]=1;next}并输入uid\u cid。我还没有建立连接。在Windows下执行也有困难。我使用表1和表2的输入运行awk代码。我将其重定向到一个新的表_3,以确保我理解awk处理。我得到“errcount:1”我使用双引号(对于Windows.bat)而不是单引号来分隔awk处理。您是否注意处理“NULL”周围的内部引号?啊哈-不。我将尝试使用“NULL”
C000002|31799|bevonium
C000002|31800|bevonium
C000386|14950|NULL
C000386|9807916|NULL
C000386|10255083|NULL
C002284|24832095|hypromellose
C008114|5318432|NULL
C008117|799|indoleglycerol phosphate
C008117|444150|indoleglycerol phosphate
C008117|46878464|indoleglycerol phosphate