Bash 在多个文件之间保留共享条目_Bash_Perl

Bash 在多个文件之间保留共享条目

bash perl

Bash 在多个文件之间保留共享条目,bash,perl,Bash,Perl,我有数百个文件，每个文件都有不同数量的条目（>xxxx），我只想在所有文件中单独保留共享条目。我不确定做这件事的最佳方法是什么，也许是perl！我使用了bash的sort、uniq，但没有得到正确的答案。ID的格式以>开头，在所有文件中后跟4个字符 1.fa 2.fa 3.fa 本例的最终结果为： 1.fa 2.fa 3.fa 以下是Perl解决方案，它可能会帮助您： use feature qw(say); use strict; use warnings; my $file_dir = '

我有数百个文件，每个文件都有不同数量的条目（>xxxx），我只想在所有文件中单独保留共享条目。我不确定做这件事的最佳方法是什么，也许是perl！我使用了bash的sort、uniq，但没有得到正确的答案。ID的格式以>开头，在所有文件中后跟4个字符

1.fa

2.fa

3.fa

本例的最终结果为：

1.fa

2.fa

3.fa

以下是Perl解决方案，它可能会帮助您：

use feature qw(say);
use strict;
use warnings;

my $file_dir = 'files';
chdir $file_dir;
my @files = <*.fa>;

my $num_files = scalar @files;
my %ids;
for my $file (@files) {
    open ( my $fh, '<', $file) or die "Could not open file '$file': $!";
    while (my $id = <$fh>) {
        chomp $id;
        chomp (my $sequence = <$fh>);
        $ids{$id}++;
    }
    close $fh;
}

for my $file (@files) {
    open ( my $fh, '<', $file) or die "Could not open file '$file': $!";
    my $new_name = $file . '.new';
    open ( my $fh_write, '>', $new_name ) or die "Could not open file '$new_name': $!";
    while (my $id = <$fh>) {
        chomp $id;
        chomp (my $sequence = <$fh>);
        if ( $ids{$id} == $num_files ) {
            say $fh_write $id;
            say $fh_write $sequence;
        }
    }
    close $fh_write;
    close $fh;
}

使用功能qw（比如说）；
严格使用；
使用警告；
my$file_dir='files'；
chdir$file\u dir；
我的@files=；
my$num_files=标量@文件；
我的%id；
对于我的$file（@files）{
open（my$fh，“此Perl程序将按您的要求执行。它使用Perl内置的在位编辑功能，并将原始文件重命名为1.fa.bak
等。只要序列始终位于ID后面的一行，数据中的空行就不会有问题
use strict;
use warnings 'all';

my @files = glob '*.fa';

printf "Processing %d file%s\n", scalar @files, @files == 1 ? "" : "s";

exit if @files < 2;

my %ids;

{
    local @ARGV = @files;

    while ( <> ) {
        ++$ids{$1} if /^>(\S+)/;
    }
}

# remove keys that aren't in all files
delete @ids{ grep { $ids{$_} < @files } keys %ids };
my $n = keys %ids;
printf "%d ID%s common to all files\n", $n, $n == 1 ? '' : "s";

exit unless $n;

{
    local @ARGV = @files;
    local $^I = '.bak';

    while ( <> ) {

        next unless /^>(\S+)/ and $ids{$1};

        print;
        print scalar <>;
    }
}

使用严格；
使用“全部”警告；
my@files=glob'*.fa'；
printf“正在处理%d个文件%s\n”、标量@files、@files==1？”：“s”；
如果@files<2，则退出；
我的%id；
{
本地@ARGV=@文件；
而（）{
++$ids{$1}如果/^>（\S+）/；
}
}
#删除不在所有文件中的密钥
删除@ids{grep{$ids{$}<@files}键%ids}；
my$n=密钥%id；
printf“%d ID%s对所有文件通用\n”，$n，$n==1？“”：“s”；
退出，除非$n；
{
本地@ARGV=@文件；
本地$^I='.bak'；
而（）{
下一个除非/^>（\S+）/和$ids{$1}；
印刷品；
打印标量；
}
}
那么所有文件都是相同的？那么为什么不保留一个文件呢？不，条目（>xxxx）将是相同的，但它们下面的行不是，我需要分别分析它们。好的，我现在看到了模式！它可以在Perl中使用哈希来完成。但是它需要不止一行。您尝试了什么？所有文件上有多少ID？ID行之间总是只有一行数据吗？@Borodin在22-30之间。是的，ID之间总是有一行。谢谢你的脚本。我得到了这个错误：“在extract.pl第27行，第175行的chomp中使用未初始化的值$sequence。”有什么想法吗？我认为$sequence没有违抗？！或者$fh，，，$FILE可能文件中有空行，或者文件中有奇数行？我假设在一行上总是有id，然后在下一行上有序列。在你的问题中提供的3个示例文件上测试脚本，你仍然得到警告吗？我有出现文件问题，我正在修复它们。第一个chomp（my$sequence=）
可能是scalar
，因为您不使用value@HåkonHæglandafter以正确的格式创建所有文件（第一行ID，然后是第二行中的字符，然后是另一个ID）但我还是遇到了错误：在chomp at…中使用未初始化的值$sequence…如果同一个文件中多次出现相同的>abcd
，您的代码将无法工作。（虽然不难更正。我不知道是否会发生这种情况（可能应该询问OP）；只是想提一下以防万一）@达达：如果ID出现两次，那就不算什么了，是吧！我倾向于把ID
理解为identifier，并且认为多次使用同一个ID是可以的（除非是html/xml之类的）…这是“ID”含义的“唯一”部分吗？@Dada:是的，尤其是在计算中。标识符标识某些东西，如果两个或多个东西具有相同的名称，则您无法分辨引用的是哪一个，因此它不能作为标识符。@Borodin我遇到了此错误：“所有文件共有0个ID”。仅供参考，每个文件中的ID不超过一次，所有文件都以ID开头（>xxxx）和下一行文本，然后是另一个ID。
>abcd
ATGCAATA
>efgh
TAACGTAA
>ijkl
TGCAA

>abcd
CTGAATGCC

>abcd
AAATGCGCG

>abcd
ATGCAATA

use feature qw(say);
use strict;
use warnings;

my $file_dir = 'files';
chdir $file_dir;
my @files = <*.fa>;

my $num_files = scalar @files;
my %ids;
for my $file (@files) {
    open ( my $fh, '<', $file) or die "Could not open file '$file': $!";
    while (my $id = <$fh>) {
        chomp $id;
        chomp (my $sequence = <$fh>);
        $ids{$id}++;
    }
    close $fh;
}

for my $file (@files) {
    open ( my $fh, '<', $file) or die "Could not open file '$file': $!";
    my $new_name = $file . '.new';
    open ( my $fh_write, '>', $new_name ) or die "Could not open file '$new_name': $!";
    while (my $id = <$fh>) {
        chomp $id;
        chomp (my $sequence = <$fh>);
        if ( $ids{$id} == $num_files ) {
            say $fh_write $id;
            say $fh_write $sequence;
        }
    }
    close $fh_write;
    close $fh;
}

use strict;
use warnings 'all';

my @files = glob '*.fa';

printf "Processing %d file%s\n", scalar @files, @files == 1 ? "" : "s";

exit if @files < 2;

my %ids;

{
    local @ARGV = @files;

    while ( <> ) {
        ++$ids{$1} if /^>(\S+)/;
    }
}

# remove keys that aren't in all files
delete @ids{ grep { $ids{$_} < @files } keys %ids };
my $n = keys %ids;
printf "%d ID%s common to all files\n", $n, $n == 1 ? '' : "s";

exit unless $n;

{
    local @ARGV = @files;
    local $^I = '.bak';

    while ( <> ) {

        next unless /^>(\S+)/ and $ids{$1};

        print;
        print scalar <>;
    }
}