Shell 如何将每列中的字符转换为子列而不重复_Shell

Shell 如何将每列中的字符转换为子列而不重复

shell

Shell 如何将每列中的字符转换为子列而不重复,shell,Shell,我有一个数据文件。它看起来像：输入1： 1 20022 44444 44444 2 31012 22233 44444 3 31012 22233 00444 4 20022 44444 00444 5 20022 44444 00444 6 20022 44444 00444 7 31012 44444 00444 8 31012 44444 00444 9 31012 87634 44444 10 20022 87634 44444 我想将每列中的每个字符转换为一个子列，并且我想将1或

我有一个数据文件。它看起来像：

输入1：

1 20022 44444 44444
2 31012 22233 44444
3 31012 22233 00444
4 20022 44444 00444
5 20022 44444 00444
6 20022 44444 00444
7 31012 44444 00444 
8 31012 44444 00444
9 31012 87634 44444
10 20022 87634 44444

我想将每列中的每个字符转换为一个子列，并且我想将1或0放入行中，以表示是否在该特定行中观察到该子列：

产出1：

    c1.20022 c1.31012 c2.44444 c2.22233 c2.87634 c3.44444 c3.00444
    1   1      0        1       0         0         1      0 
    2   0      1        0       1         0         1      0
    3   0      1        0       1         0         0      1
    4   1      0        1       0         0         0      1
    5   1      0        1       0         0         0      1
    6   1      0        1       0         0         0      1
    7   0      1        1       0         0         0      1
    8   0      1        1       0         0         0      1
    9   0      1        0       0         1         1      0
    10  1      0        0       0         1         1      0

我的真实数据还有100000多个列和行。我还应该提到，我想在Linux中运行这个程序

第二部分：我想删除那些在每一列中重复不到一百次的字符，我不想为这些字符添加任何子列。例如，在我的示例input.file中，我想删除重复次数少于3次的字符：

输入2：

 1 20022 44444 44444
 2 31012  NA   44444
 3 31012  NA   00444
 4 20022 44444 00444
 5 20022 44444 00444
 6 20022 44444 00444
 7 31012 44444 00444 
 8 31012 44444 00444
 9 31012  NA   44444
10 20022  NA   44444

And output:

output2:
     c1.20022 c1.31012 c2.44444 c3.44444 c3.00444
    1   1      0        1         1      0 
    2   0      1        0         1      0
    3   0      1        0         0      1
    4   1      0        1         0      1
    5   1      0        1         0      1
    6   1      0        1         0      1
    7   0      1        1         0      1
    8   0      1        1         0      1
    9   0      1        0         1      0
    10  1      0        0         1      0

为了直接从我的第一个输入（input1）到达最后一个输出（output2），我应该在下面的答案中写的shell脚本中修改什么

一点更新：如果在我的输入中，每两行代表一个人（第1行和第2行属于个人1）：

我希望在我的output.txt中，每个单独的字符只重复一次，同时将每列中的每个字符转换为一个子列，我希望在行中输入2、1或零，它们表示每个单独的字符在子列中重复的次数。同时，我想删除在每列中重复少于3次的字符（这里是第2列中的00000和11112）：

output1.txt：

  c1.20022 c1.31012 c2.44444 c3.44444 c3.004444
1      1       1         2        2        0
2      1       1         1        0        2
3      1       0         1        0        2
4      0       2         2        0        2
5      1       1         0        2        0

在这里，我在数字之间加上空格，以便于理解。但事实上，这些空格是不需要的（例如：first-line:11220）

作为一个非fortran解决方案，我编写了一个（g）awk脚本，它可以满足您的需要，并且您的文件应该给它两次。在第一次运行中，它构建了一个出现在每一列中的标签数组，这是该过程中唯一占用大量内存的步骤。在后处理阶段，每一列都被一行一行地独立处理，所以我猜它的实用性取决于标题值的分布

重要注意事项：为了能够循环第二个索引，脚本使用了真正的2d语法数组，而不是标准的

awk

。这将在中起作用，但其他

awk

风格可能不支持它

foo.awk

：

#!/usr/bin/gawk

#set up label array from first run
NR==FNR{
  for(i=2; i<=NF; i++){
    labels[i][$i]=1;
  }
}

#do actual printing in second run
NR!=FNR{
  if(FNR==1){   #then print header
    printf "       ";
    for(i=2; i<=NF; i++){   #i corresponds to columns in input
      for(label in labels[i]){
        printf " c%d.%s ",i-1,label};  #note i-1
      }
      print ""; #newline
  };

  printf "%10d", FNR; #column 1: line number
  for(i=2; i<=NF; i++){
    for(label in labels[i]){  #loop over every possible label in column i
      if($i==label){
        printf "    1     ";  #1 if same
      }
      else {
        printf "    0     ";  #0 if different
      }
    };
  }
  print ""; #newline
}

#!/usr/bin/gawk

#set up label array from first run
NR==FNR{
  for(i=2; i<=NF; i++){
    labels[i][$i]++; #counter instead of indicator
  }
}

#do actual printing in second run
NR!=FNR{
  if(FNR==1){   #then print header
    printf "       ";
    for(i=2; i<=NF; i++){   #i corresponds to columns in input
      for(label in labels[i]){
        if(labels[i][label]<3) continue;  #skip labels which appear at most 2 times
        printf " c%d.%s ",i-1,label};  #note i-1
      }
      print ""; #newline
  };

  printf "%10d", FNR; #column 1: line number
  for(i=2; i<=NF; i++){
    for(label in labels[i]){  #loop over every possible label in column i
      if(labels[i][label]<3) continue;  #skip labels which appear at most 2 times
      if($i==label){
        printf "    1     ";  #1 if same
      }
      else {
        printf "    0     ";  #0 if different
      }
    };
  }
  print ""; #newline
}

#!/usr/bin/gawk

#keep count of number of files (from first colum of first row)
{if($1==1) nfiles++;}

#set up label array from first run
nfiles==1{
  for(i=2; i<=NF/2; i++){ #go over first half columns
    labels[i][$i]++;        #odd lines
    labels[i][$(i+NF/2)]++; #even lines
  }
}

#do actual printing in second run
nfiles==2{
  if($1==1){   #then print header
    printf "       ";
    for(i=2; i<=NF/2; i++){   #i corresponds to columns in input
      for(label in labels[i]){
        if(labels[i][label]<3) continue;  #skip labels which appear at most 2 times
        printf " c%d.%s ",i-1,label};  #note i-1
      }
      print ""; #newline
  };

  printf "%10d ", $1; #column 1: line number
  for(i=2; i<=NF/2; i++){
    for(label in labels[i]){  #loop over every possible label in column i
      if(labels[i][label]<3) continue;  #skip labels which appear at most 2 times

      multi=0 #multiplicity of label "label" in line i
      if($i==label) multi++;
      if($(i+NF/2)==label) multi++;

      printf " %3d    ", multi;

    };
  }
  print ""; #newline
}

将其设置为可执行文件后由

/bar.sh infle

运行，其中“

infle

”应替换为输入文件的实际名称。显然，您可以跳过shell脚本，只需调用

gawk-f foo8.awk infle infle

，但我实在是太懒了，不能多次这样做

另外，请注意，您可能希望删除

printf

命令中的大部分空白。这些都是为了获得一个漂亮的输出，但是您可能不会手动查看输出，而是使用一些自动化的后处理方法。但是所有这些空白都会破坏你最终得到的已经很大的文件。因此，我建议在每个

printf

的开头保留一个空格，以便将列彼此分开，并删除其余的列

输出：

c1.20022  c1.31012  c2.44444  c2.87634  c2.22233  c3.00444  c3.44444 
 1    1         0         1         0         0         0         1     
 2    0         1         0         0         1         0         1     
 3    0         1         0         0         1         1         0     
 4    1         0         1         0         0         1         0     
 5    1         0         1         0         0         1         0     
 6    1         0         1         0         0         1         0     
 7    0         1         1         0         0         1         0     
 8    0         1         1         0         0         1         0     
 9    0         1         0         1         0         0         1     
10    1         0         0         1         0         0         1

c1.20022  c1.31012  c2.44444  c3.00444  c3.44444 
 1    1         0         1         0         1     
 2    0         1         0         0         1     
 3    0         1         0         1         0     
 4    1         0         1         1         0     
 5    1         0         1         1         0     
 6    1         0         1         1         0     
 7    0         1         1         1         0     
 8    0         1         1         1         0     
 9    0         1         0         0         1     
10    1         0         0         0         1

c1.20022  c1.31012  c2.44444  c3.00444  c3.44444 
 1   1       1       2       0       2    
 2   1       1       1       2       0    
 3   2       0       2       2       0    
 4   0       2       2       2       0    
 5   1       1       0       0       2

更新关于你的最新问题：

我想删除那些在每列中重复不到一百次的字符，并且我不希望这些字符有任何子列。例如，在我的示例input.file中，我想删除那些重复次数少于3次的字符

这是你的幸运日，因为上面的脚本只需要简单的更改就可以实现。为此，我们将

标签[i][label]

变量从指示器更改为计数器，也就是说，当我们找到相同的标签时，我们会不断增加它们的值。然后在第二次运行期间，我们只需跳过那些最多出现2次的标签

更新的

foo.awk

：

#!/usr/bin/gawk

#set up label array from first run
NR==FNR{
  for(i=2; i<=NF; i++){
    labels[i][$i]=1;
  }
}

#do actual printing in second run
NR!=FNR{
  if(FNR==1){   #then print header
    printf "       ";
    for(i=2; i<=NF; i++){   #i corresponds to columns in input
      for(label in labels[i]){
        printf " c%d.%s ",i-1,label};  #note i-1
      }
      print ""; #newline
  };

  printf "%10d", FNR; #column 1: line number
  for(i=2; i<=NF; i++){
    for(label in labels[i]){  #loop over every possible label in column i
      if($i==label){
        printf "    1     ";  #1 if same
      }
      else {
        printf "    0     ";  #0 if different
      }
    };
  }
  print ""; #newline
}

#!/usr/bin/gawk

#set up label array from first run
NR==FNR{
  for(i=2; i<=NF; i++){
    labels[i][$i]++; #counter instead of indicator
  }
}

#do actual printing in second run
NR!=FNR{
  if(FNR==1){   #then print header
    printf "       ";
    for(i=2; i<=NF; i++){   #i corresponds to columns in input
      for(label in labels[i]){
        if(labels[i][label]<3) continue;  #skip labels which appear at most 2 times
        printf " c%d.%s ",i-1,label};  #note i-1
      }
      print ""; #newline
  };

  printf "%10d", FNR; #column 1: line number
  for(i=2; i<=NF; i++){
    for(label in labels[i]){  #loop over every possible label in column i
      if(labels[i][label]<3) continue;  #skip labels which appear at most 2 times
      if($i==label){
        printf "    1     ";  #1 if same
      }
      else {
        printf "    0     ";  #0 if different
      }
    };
  }
  print ""; #newline
}

#!/usr/bin/gawk

#keep count of number of files (from first colum of first row)
{if($1==1) nfiles++;}

#set up label array from first run
nfiles==1{
  for(i=2; i<=NF/2; i++){ #go over first half columns
    labels[i][$i]++;        #odd lines
    labels[i][$(i+NF/2)]++; #even lines
  }
}

#do actual printing in second run
nfiles==2{
  if($1==1){   #then print header
    printf "       ";
    for(i=2; i<=NF/2; i++){   #i corresponds to columns in input
      for(label in labels[i]){
        if(labels[i][label]<3) continue;  #skip labels which appear at most 2 times
        printf " c%d.%s ",i-1,label};  #note i-1
      }
      print ""; #newline
  };

  printf "%10d ", $1; #column 1: line number
  for(i=2; i<=NF/2; i++){
    for(label in labels[i]){  #loop over every possible label in column i
      if(labels[i][label]<3) continue;  #skip labels which appear at most 2 times

      multi=0 #multiplicity of label "label" in line i
      if($i==label) multi++;
      if($(i+NF/2)==label) multi++;

      printf " %3d    ", multi;

    };
  }
  print ""; #newline
}

更新2 关于你两次更新的问题

一点更新：如果在我的输入中，每两行代表一个个体（第1行和第2行属于个体1）：

现在您有了跨越两行的数据，并且希望将它们一起处理。请注意，随着问题变得越来越复杂，解决方案也会变得越来越复杂。为了避免并发症，我假设每个人正好有两行，似乎是这样。我还必须假设输入文件的第一行以1开头。情况似乎也是如此，但上述解决方案没有利用这一点。事实上，假设个体的跨度从1到个体总数不等，没有间隙。它可以用一种更一般的方式来完成，但我不想无缘无故地把它复杂化

新建

bar.sh

：

#!/bin/bash

infile=$1

gawk -f foo.awk $infile $infile

#!/bin/bash

infile=$1

cat $infile $infile |paste - - |gawk -f foo.awk

这将使每对输入行彼此相邻，这样现在每个输入行又只在一行上，然后将修改后的文件两次馈送到

foo.awk

新的

foo.awk

：

#!/usr/bin/gawk

#set up label array from first run
NR==FNR{
  for(i=2; i<=NF; i++){
    labels[i][$i]=1;
  }
}

#do actual printing in second run
NR!=FNR{
  if(FNR==1){   #then print header
    printf "       ";
    for(i=2; i<=NF; i++){   #i corresponds to columns in input
      for(label in labels[i]){
        printf " c%d.%s ",i-1,label};  #note i-1
      }
      print ""; #newline
  };

  printf "%10d", FNR; #column 1: line number
  for(i=2; i<=NF; i++){
    for(label in labels[i]){  #loop over every possible label in column i
      if($i==label){
        printf "    1     ";  #1 if same
      }
      else {
        printf "    0     ";  #0 if different
      }
    };
  }
  print ""; #newline
}

#!/usr/bin/gawk

#set up label array from first run
NR==FNR{
  for(i=2; i<=NF; i++){
    labels[i][$i]++; #counter instead of indicator
  }
}

#do actual printing in second run
NR!=FNR{
  if(FNR==1){   #then print header
    printf "       ";
    for(i=2; i<=NF; i++){   #i corresponds to columns in input
      for(label in labels[i]){
        if(labels[i][label]<3) continue;  #skip labels which appear at most 2 times
        printf " c%d.%s ",i-1,label};  #note i-1
      }
      print ""; #newline
  };

  printf "%10d", FNR; #column 1: line number
  for(i=2; i<=NF; i++){
    for(label in labels[i]){  #loop over every possible label in column i
      if(labels[i][label]<3) continue;  #skip labels which appear at most 2 times
      if($i==label){
        printf "    1     ";  #1 if same
      }
      else {
        printf "    0     ";  #0 if different
      }
    };
  }
  print ""; #newline
}

#!/usr/bin/gawk

#keep count of number of files (from first colum of first row)
{if($1==1) nfiles++;}

#set up label array from first run
nfiles==1{
  for(i=2; i<=NF/2; i++){ #go over first half columns
    labels[i][$i]++;        #odd lines
    labels[i][$(i+NF/2)]++; #even lines
  }
}

#do actual printing in second run
nfiles==2{
  if($1==1){   #then print header
    printf "       ";
    for(i=2; i<=NF/2; i++){   #i corresponds to columns in input
      for(label in labels[i]){
        if(labels[i][label]<3) continue;  #skip labels which appear at most 2 times
        printf " c%d.%s ",i-1,label};  #note i-1
      }
      print ""; #newline
  };

  printf "%10d ", $1; #column 1: line number
  for(i=2; i<=NF/2; i++){
    for(label in labels[i]){  #loop over every possible label in column i
      if(labels[i][label]<3) continue;  #skip labels which appear at most 2 times

      multi=0 #multiplicity of label "label" in line i
      if($i==label) multi++;
      if($(i+NF/2)==label) multi++;

      printf " %3d    ", multi;

    };
  }
  print ""; #newline
}

输出：

c1.20022  c1.31012  c2.44444  c2.87634  c2.22233  c3.00444  c3.44444 
 1    1         0         1         0         0         0         1     
 2    0         1         0         0         1         0         1     
 3    0         1         0         0         1         1         0     
 4    1         0         1         0         0         1         0     
 5    1         0         1         0         0         1         0     
 6    1         0         1         0         0         1         0     
 7    0         1         1         0         0         1         0     
 8    0         1         1         0         0         1         0     
 9    0         1         0         1         0         0         1     
10    1         0         0         1         0         0         1

c1.20022  c1.31012  c2.44444  c3.00444  c3.44444 
 1    1         0         1         0         1     
 2    0         1         0         0         1     
 3    0         1         0         1         0     
 4    1         0         1         1         0     
 5    1         0         1         1         0     
 6    1         0         1         1         0     
 7    0         1         1         1         0     
 8    0         1         1         1         0     
 9    0         1         0         0         1     
10    1         0         0         0         1

c1.20022  c1.31012  c2.44444  c3.00444  c3.44444 
 1   1       1       2       0       2    
 2   1       1       1       2       0    
 3   2       0       2       2       0    
 4   0       2       2       2       0    
 5   1       1       0       0       2

请注意，您可以通过更改

printf " %3d    ", multi;

到

还要注意，我的示例输出与您的不同，但从您的规范来看，我的版本似乎是正确的（例如，对于个人3，第一列中应该有一个“2”）

我想我理解您的要求。这些数字必须作为字符串处理吗？我在你的例子中看到，

并没有简化为

，

和

之间有什么区别吗？@Ross。对它必须保持在00444。因为它是一个字符而不是一个值。Related@这个解决方案是用R而不是n fortran编写的。我的数据是巨大的，R没有足够的内存来处理这个仍然相关的问题。也许稍作修改就足够了。Fortran和R中的限制同样适用。您尝试过什么吗？这不是一个代码编写服务。@Andras Deak:你的解决方案简直太棒了！我是鲁