Awk bash从其他文件添加/追加新列_Awk_Gnu Parallel

Awk bash从其他文件添加/追加新列

awk

Awk bash从其他文件添加/追加新列,awk,gnu-parallel,Awk,Gnu Parallel,我有一个一列的name.txt文件，例如 A B C D E F 然后我有很多文件，例如x.txt、y.txt和z.txt x、 txt有 A 1 C 3 D 2 y、 txt有 A 1 B 4 E 3 z、 txt有 B 2 D 2 F 1 理想的输出是（如果没有映射，则填写0）有可能用bash制作吗？（也许是awk？非常感谢第一次编辑-我的初步努力因为我对bash还很陌生，所以我很难用awk找到一个可能的解决方案。我更熟悉R，在R中，这可以通过 namematrix[nam

我有一个一列的name.txt文件，例如

A
B
C
D
E
F

然后我有很多文件，例如x.txt、y.txt和z.txt

x、 txt有

A 1
C 3
D 2

y、 txt有

A 1
B 4
E 3

z、 txt有

B 2
D 2
F 1

理想的输出是（如果没有映射，则填写0）

有可能用bash制作吗？（也许是awk？
非常感谢

第一次编辑-我的初步努力
因为我对bash还很陌生，所以我很难用awk找到一个可能的解决方案。我更熟悉R，在R中，这可以通过

namematrix[namematrix[,1]==xmatrix[,1],]

总之，我非常感谢下面的帮助，帮助我了解更多关于

awk

和

join

第二次编辑-一个超级高效的方法

幸运的是，受到下面一些非常出色的答案的启发，我整理出了一种计算效率非常高的方法，如下所示。这对遇到类似问题的其他人可能会有所帮助，特别是当他们处理大量的文件和非常大的大小时。

首先触摸一个join_awk.bash

#/bin/bash
join-oauto-e0-a1$1$2 | awk'{print$2}'

例如，为name.txt和x.txt执行此bash脚本

join_awk.bash name.txt x.txt

会产生

请注意，这里我只保留第二列以节省磁盘空间，因为在我的数据集中，第一列是非常长的名称，这将占用巨大的磁盘空间

然后简单地实现

parallel join_awk.bash name.txt{}\>outdir/output.{}:：{a，b，c}.txt

这是从下面使用GNU并行和连接的精彩答案中得到的启发。不同之处在于，下面的答案必须指定

j1

用于

parallel

，这是由于其串行附加逻辑，这使得它不是真正的“并行”。此外，随着串行追加的继续，速度将变得越来越慢。相反，这里我们分别并行地处理每个文件。当我们使用多个CPU处理大量大文件时，速度可能会非常快

最后，只需通过以下方式将所有单列输出文件合并在一起

cd-outdir
粘贴输出*>merged.txt

这也会非常快，因为粘贴本身就是并行的。

是的，你可以这样做，是的，

awk

是工具。使用数组和常规文件行号（

FNR

file number of records）和总行数（

NR

records），您可以将

names.txt

中的所有字母读入

a[]

数组，然后在变量

fno

中跟踪文件号，您可以添加从

x.txt

添加的所有内容，然后在处理下一个文件的第一行（

y.txt

）之前，循环上一个文件中看到的所有字母，对于未看到的放置
0
，然后继续正常处理。对每个附加文件重复此操作
注释中显示了进一步的逐行解释：

awk ' FNR==NR { # first file a[$1] = "" # fill array with letters as index fno = 1 # set file number counter next # get next record (line) } FNR == 1 { fno++ } # first line in file, increment file count fno > 2 && FNR == 1 { # file no. 3+ (not run on x.txt) for (i in a) # loop over letters if (!(i in seen)) # if not in seen array a[i] = a[i]" "0 # append 0 delete seen # delete seen array } $1 in a { # if line begins with letter in array a[$1] = a[$1]" "$2 # append second field seen[$1]++ # add letter to seen array } END { for (i in a) # place zeros for last column if (!(i in seen)) a[i] = a[i]" "0 for (i in a) # print results print i a[i] }' name.txt x.txt y.txt z.txt
示例使用/输出
只需将上述内容复制到包含您的文件的当前目录下的xterm中，然后鼠标中键粘贴，您将收到：

A 1 1 0 B 0 4 2 C 3 0 0 D 2 0 2 E 0 3 0 F 0 0 1

创建自包含脚本
如果您想创建一个脚本来运行，而不是在命令行上粘贴，那么只需包含内容（不使用单引号括起来），然后使文件可执行。例如，第一行包括解释器，内容如下：

#!/usr/bin/awk -f FNR==NR { # first file a[$1] = "" # fill array with letters as index fno = 1 # set file number counter next # get next record (line) } FNR == 1 { fno++ } # first line in file, increment file count fno > 2 && FNR == 1 { # file no. 3+ (not run on x.txt) for (i in a) # loop over letters if (!(i in seen)) # if not in seen array a[i] = a[i]" "0 # append 0 delete seen # delete seen array } $1 in a { # if line begins with letter in array a[$1] = a[$1]" "$2 # append second field seen[$1]++ # add letter to seen array } END { for (i in a) # place zeros for last column if (!(i in seen)) a[i] = a[i]" "0 for (i in a) # print results print i a[i] }

awk
将按照给定的顺序处理作为参数给出的文件名
示例使用/输出
使用脚本文件（我将其放入
names.awk
中，然后使用
chmod+x names.awk
使其可执行），然后执行以下操作：

$ ./names.awk name.txt x.txt y.txt z.txt A 1 1 0 B 0 4 2 C 3 0 0 D 2 0 2 E 0 3 0 F 0 0 1

如果您还有其他问题，请告诉我。
您可以使用此
awk
：

awk'NF==2{ 映射[文件名，$1]=$2 下一个 } { printf“%s”，1美元对于（f=1；f和bash #!/bin/bash declare -A hash # use an associative array for f in "x.txt" "y.txt" "z.txt"; do # loop over these files while read -r key val; do # read key and val pairs hash[$f,$key]=$val # assign the hash to val done < "$f" done while read -r key; do echo -n "$key" # print the 1st column for f in "x.txt" "y.txt" "z.txt"; do # loop over the filenames echo -n " ${hash[$f,$key]:-0}" # print the associated value or "0" if undefined done echo # put a newline done < "name.txt" ！/bin/bash 声明-散列#使用关联数组对于“x.txt”“y.txt”“z.txt”中的f，请在这些文件上循环读取时-r键val；do#读取键和val对散列[$f，$key]=$val#将散列分配给val 完成<“$f” 完成当读取-r键时；执行 echo-n“$key”#打印第一列对于“x.txt”“y.txt”“z.txt”中的f，请在文件名上循环 echo-n“${hash[$f，$key]：-0}”#打印关联的值，如果未定义，则打印“0” 完成回声#换行完成<“name.txt” 添加一种方法。请尝试以下方法，使用显示的示例编写和测试。IMHO应该在任何awk 中工作，尽管我只有3.1版本的GNUawk 。这是非常简单和常用的方法，首先创建一个数组（主要）输入文件的读数，然后在每个文件的后面添加0 whicher元素，该数组的元素在特定的输入文件中找不到，仅使用小的给定样本进行测试 awk ' function checkArray(array){ for(i in array){ if(!(i in found)){ array[i]=array[i] OFS "0" } } } FNR==NR{ arr[$0] next } foundCheck && FNR==1{ checkArray(arr) delete found foundCheck="" } { if($1 in arr){ arr[$1]=(arr[$1] OFS $2) found[$1] foundCheck=1 next } } END{ checkArray(arr) for(key in arr){ print key,arr[key] } } ' name.txt x.txt y.txt z.txt 说明：添加上述内容的详细说明 awk ' ##Starting awk program from here. function checkArray(array){ ##Creating a function named checkArray from here. for(i in array){ ##CTraversing through array here. if(!(i in found)){ array[i]=array[i] OFS "0" } ##Checking condition if key is NOT in found then append a 0 in that specific value. } } FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when names.txt is being read. arr[$0] ##Creating array with name arr with index of current line. next ##next will skip all further statements from here. } foundCheck && FNR==1{ ##Checking condition if foundCheck is SET and this is first line of Input_file. checkArray(arr) ##Calling function checkArray by passing arr array name in it. delete found ##Deleting found array to get rid of previous values. foundCheck="" ##Nullifying foundCheck here. } { if($1 in arr){ ##Checking condition if 1st field is present in arr. arr[$1]=(arr[$1] OFS $2) ##Appening 2nd field value to arr with index of $1. found[$1] ##Adding 1st field to found as an index here. foundCheck=1 ##Setting foundCheck here. next ##next will skip all further statements from here. } } END{ ##Starting END block of this program from here. checkArray(arr) ##Calling function checkArray by passing arr array name in it. for(key in arr){ ##Traversing thorugh arr here. print key,arr[key] ##Printing index and its value here. } } ' name.txt x.txt y.txt z.txt ##Mentioning Input_file names here. 使用GNU awk的另一种方法 $ cat script.awk NF == 1 { name[$1] = $1 for (i = 1; i < ARGC - 1; i++) { name[$1] = name[$1] " 0" } next } { name[$1] = gensub(/ ./, " " $2, ARGIND - 1, name[$1]) } END { for (k in name) { print name[k] } } 输出显示的顺序与name.txt 相同，但我不认为这适用于所有类型的输入。您可以使用join join -a1 -e0 -o '0,2.2' name.txt x.txt | join -a1 -e0 -o '0,1.2,2.2' - y.txt | join -a1 -e0 -o '0,1.2,1.3,2.2' - z.txt 这可能适合您（GNU并行和连接）：输出将在文件out 中0 s从何而来？（好的，如果文件中没有字母，您可以添加一个0 ）相关：这应该可以为您做到：加入-e0-j1这是一个非常简洁的方法。做得好。（它是POSIX）总有一天，我会充分消化，首先看到优雅的方法：）非常感谢！这是一个具有清晰逻辑和简洁表达式的伟大解决方案（而且速度也相对较快！），它很有帮助 $ cat script.awk NF == 1 { name[$1] = $1 for (i = 1; i < ARGC - 1; i++) { name[$1] = name[$1] " 0" } next } { name[$1] = gensub(/ ./, " " $2, ARGIND - 1, name[$1]) } END { for (k in name) { print name[k] } } join -a1 -e0 -o '0,2.2' name.txt x.txt | join -a1 -e0 -o '0,1.2,2.2' - y.txt | join -a1 -e0 -o '0,1.2,1.3,2.2' - z.txt cp name.txt out && t=$(mktemp) && parallel -j1 join -oauto -e0 -a1 out {} \> $t \&\& mv $t out ::: {x,y,z}.txt