Bash 删除小写字母数超过30%的行_Bash_Awk_Sed

Bash 删除小写字母数超过30%的行

bash awk sed

Bash 删除小写字母数超过30%的行,bash,awk,sed,Bash,Awk,Sed,我试图处理一些数据，但找不到解决问题的有效方法。我有一个文件，看起来像： >ram cacacacacacacacacatatacacatacacatacacacacacacacacacacacacaca cacacacacacacaca >pam GAATGTCAAAAAAAAAAAAAAAAActctctct >sam AATTGGCCAATTGGCAATTCCGGAATTCaattggccaattccggaattccaattccgg and many lines mor

我试图处理一些数据，但找不到解决问题的有效方法。我有一个文件，看起来像：

>ram
cacacacacacacacacatatacacatacacatacacacacacacacacacacacacaca
cacacacacacacaca
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
>sam
AATTGGCCAATTGGCAATTCCGGAATTCaattggccaattccggaattccaattccgg

and many lines more....

我想过滤掉所有的行和相应的标题（标题以>）其中序列字符串（那些不以>）包含30%或更多的小写字母。序列字符串可以跨越多行

因此，在命令xy之后，输出应如下所示：

>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct

我尝试了一些while循环的混合方法来读取输入文件，然后使用awk、grep和sed，但是没有好的结果

这里有一个想法，它将记录分隔符设置为“>”，以将每个标题及其序列行视为单个记录

由于输入以“>”开头，这会导致初始记录为空，因此我们使用

NR>1

（记录数大于1）来保护计算

为了计算字符数，我们在标题后面加上所有行的长度。为了计算小写字符的数量，我们将字符串保存在另一个变量中，并使用gsub将所有小写字母替换为零，因为gsub返回替换的数量，这是一种方便的计算方法

最后，我们检查比率并打印或不打印（打印时加回初始“>”

开始{RS=“>”}
NR>1{
总数=0
下_cnt=0
对于（i=2；ipam）
gaatgtcaaaaaaaaaaaaaaaaaaactctct
awk'/^>/{b=b；gsub（/[A-]/，“”，b）；
如果（长度（b）<长度（b）*0.3）打印H“\n”b
H=$0；B=“”；下一个}
{B=（（B！=）？B“\n:”）$0}
结束{b=b；gsub（/[A-]/，“”，b）；
如果（长度（b）<长度（b）*0.3）打印H“\n”b
}“你的档案

quick qnd dirty是一个功能套件，可以更好地满足打印需求，或者：
awk '{n=length(gensub(/[A-Z]/,"","g"));if(NF && n/length*100 < 30)print a $0;a=RT}' RS='>[a-z]+\n' file

awk'{n=length（gensub（/[A-Z]/，“”，“g”）；if（NF&&n/length*100<30）打印$0；A=RT}'RS='>[A-Z]+\n'文件

RS='>[a-z]+\n'
-将记录分隔符设置为包含'>'和名称的行
RT
-该值由上述RS匹配的值设置
a=RT
-保存以前的RT值
n=length（gensub（/[A-Z]/，“”，“g”）；
-获取小写字符的长度
如果（NF&&n/length*100<30）打印$0；
-检查我们有一个值，并且小写字符的百分比小于30
现在我不再使用sed
或awk
来处理任何超过2行的内容
#! /usr/bin/perl
use strict;                                # Force variable declaration.
use warnings;                              # Warn about dangerous language use.

sub filter                                 # Declare a sub-routing, a function called `filter`.
{
  my ($header, $body) = @_;                # Give the first two function arguments the names header and body.
  my $lower = $body =~ tr/a-z//;           # Count the translation of the characters a-z to nothing.
  print $header, $body, "\n"               # Print header, body and newline,
    unless $lower / length ($body) > 0.3;  # unless lower characters have more than 30%.
}

my ($header, $body);                       # Declare two variables for header and body.
while (<>) {                               # Loop over all lines from stdin or a file given in the command line.
  if (/^>/) {                              # If the line starts with >,
    filter ($header, $body)                # call filter with header and body,
      if defined $header;                  # if header is defined, which is not the case at the beginning of the file.
    ($header, $body) = ($_, '');           # Assign the current line to header and an empty string to body.
  } else {
    chomp;                                 # Remove the newline at the end of the line.
    $body .= $_;                           # Append the line to body.
  }
}
filter ($header, $body);                   # Filter the last record.

！/usr/bin/perl
使用strict；#Force变量声明。
使用警告；#警告使用危险语言。
子筛选器#声明一个子路由，一个名为'filter'的函数。
{
my（$header，$body）=@35;给出头两个函数参数的名称header和body。
my$lower=$body=~tr/a-z/；#将字符a-z的翻译计算为零。
打印$header，$body，“\n”#打印header，body和换行符，
除非$lower/length（$body）>0.3；#除非较低的字符超过30%。
}
my（$header，$body）#为header和body声明两个变量。
while（）{#循环stdin中的所有行或命令行中给定的文件。
如果（/^>/）{#如果该行以>开头，
过滤器（$header，$body）#调用带有标题和主体的过滤器，
如果定义了$header；#如果定义了header，则文件开头不是这种情况。
（$header，$body）=（$）；#将当前行分配给header，将空字符串分配给body。
}否则{
chomp；#删除行末尾的换行符。
$body.=$\u；#将行附加到body。
}
}
筛选（$header，$body）#筛选最后一条记录。
您尝试了一次，但失败了？向我们展示您的努力。另外，bash
也不适用于此，因为它无法计算或比较浮点值。您可以很好地删除bash
标记是否将标题包括在计数中？请注意，以
开头的行贝尔，不应该被考虑在内，因为它不是序列字符串的一部分。不，在计算比率时，我既不看标题中的字符，也不计算它们。（这就是为什么for循环以I=2开头）谢谢你的回答：）gawk仅此而已，但做得很好。可能会添加一些解释。非常感谢。这很有效，您的描述也很好，帮助我理解正在发生的事情。感谢您的回答，但遗憾的是，我从未使用过Perl。因此我无法真正理解您的代码。如果我想要在awk或sed中更适合我的需要。@JFS31我添加了一些评论。也许这是学习新东西的好例子。这很好。我会看看并尝试从中获得一些东西：）谢谢你的回答。这看起来真的很花哨，我会尝试去了解那里发生了什么。
awk '{n=length(gensub(/[A-Z]/,"","g"));if(NF && n/length*100 < 30)print a $0;a=RT}' RS='>[a-z]+\n' file

#! /usr/bin/perl
use strict;                                # Force variable declaration.
use warnings;                              # Warn about dangerous language use.

sub filter                                 # Declare a sub-routing, a function called `filter`.
{
  my ($header, $body) = @_;                # Give the first two function arguments the names header and body.
  my $lower = $body =~ tr/a-z//;           # Count the translation of the characters a-z to nothing.
  print $header, $body, "\n"               # Print header, body and newline,
    unless $lower / length ($body) > 0.3;  # unless lower characters have more than 30%.
}

my ($header, $body);                       # Declare two variables for header and body.
while (<>) {                               # Loop over all lines from stdin or a file given in the command line.
  if (/^>/) {                              # If the line starts with >,
    filter ($header, $body)                # call filter with header and body,
      if defined $header;                  # if header is defined, which is not the case at the beginning of the file.
    ($header, $body) = ($_, '');           # Assign the current line to header and an empty string to body.
  } else {
    chomp;                                 # Remove the newline at the end of the line.
    $body .= $_;                           # Append the line to body.
  }
}
filter ($header, $body);                   # Filter the last record.