Python 查找模式并打印下一列(两个文件)
我想在文件2中的文件1的第1列中查找模式,然后打印文件2旁边的文件1的第二列: 文件1(两列选项卡分开): 文件2(两列选项卡分开): 所需输出(三列选项卡分隔): 临时bash代码(不工作),但对其他语言开放:Python 查找模式并打印下一列(两个文件),python,perl,bash,awk,bioinformatics,Python,Perl,Bash,Awk,Bioinformatics,我想在文件2中的文件1的第1列中查找模式,然后打印文件2旁边的文件1的第二列: 文件1(两列选项卡分开): 文件2(两列选项卡分开): 所需输出(三列选项卡分隔): 临时bash代码(不工作),但对其他语言开放: while read vl; do grep "$vl" File2 ; done < File1 谢谢,贝尔纳多希望这对你有帮助 file2_contents = [i.strip() for i in open("file2.txt").readlines()] with
while read vl; do grep "$vl" File2 ; done < File1
谢谢,贝尔纳多希望这对你有帮助
file2_contents = [i.strip() for i in open("file2.txt").readlines()]
with open("file1.txt") as file1:
for each_line in file1.readlines():
code=each_line.split('\t')[0].strip()
part=each_line.split('\t')[1].strip()
for each_file2_contents in file2_contents:
if code in each_file2_contents:
print each_file2_contents+'\t'+part
像这样的东西听起来像是你想要的:
awk '
BEGIN { FS=OFS="\t" }
NR==FNR { map[$1] = $2; next }
{
for (key in map)
if ($0 ~ key)
$0 = $0 OFS map[key]
print
}
' file1 file2
在Perl中:
use warnings;
use strict;
use Data::Dumper;
my $str1 = <<"EOS1";
APBW lung
APCA non virulent
ABKM lung
APBX lung
KK020 -
APBZ non virulent
AOSU lung
APBY non virulent
APBV joint; lung; CNS
CP001321 virulent
APBT virulent
APBU non-virulent
APCB moderadamente virulenta (nose)
CP005384 -
EOS1
my $str2 = <<"EOS2";
HS372_00243 gi|219690483|gb|CP001321.1|
HS372_00436 gi|529264994|gb|APBX01000055.1|
HS372_00445 gi|529256455|gb|APBT01000061.1|
HS372_00544 gi|529259149|gb|APBV01000035.1|
HS372_00545 gi|529259149|gb|APBV01000035.1|
HS372_00546 gi|529259149|gb|APBV01000035.1|
EOS2
#HS372_00243 gi|219690483|gb|CP001321.1| virulent
my %data1;my %data2;
foreach my $line ( split(/\n+/,$str1) ){
my @f = split(/\t/,$line);
$data1{$f[0]} = $f[1];
}
my @data2 = split(/\n+/,$str2);
foreach my $line ( split(/\n+/,$str2) ){
my @f = split(/\t/,$line);
my @sf = split(/\|/,$f[1]);
$data2{$sf[3]} = $line;
}
#print Dumper(\%data2);
foreach my $s1 ( sort { length($b) <=> length($a) } keys %data1){
foreach my $d2 (@data2){
my @f = split(/\t/,$d2);
my @sf = split(/\|/,$f[1]);
if ($sf[3] =~ m!^$s1!is){
#warn "found $s1 in $d2\n";
print "$d2\t$data1{$s1}\n";
}
}
}
output:
HS372_00243 gi|219690483|gb|CP001321.1| virulent
HS372_00445 gi|529256455|gb|APBT01000061.1| virulent
HS372_00544 gi|529259149|gb|APBV01000035.1| joint; lung; CNS
HS372_00545 gi|529259149|gb|APBV01000035.1| joint; lung; CNS
HS372_00546 gi|529259149|gb|APBV01000035.1| joint; lung; CNS
HS372_00436 gi|529264994|gb|APBX01000055.1| lung
使用警告;
严格使用;
使用数据::转储程序;
my$str1=不清楚从文件1到文件2使用的模式。有时是相同的,有时是前四个字母…文件1中的模式有不同的长度。我的意思是,给定示例文件1,当在文件2中找到APBX01000055
时,不应写入任何内容,因为APBX01000055
不在文件1中。好的,我理解。问题是,我想打印“lung”,只是找到不完整的“APBX”字符串,这是不一致的。如果我们得到这个,那就意味着我们必须检查4个字符。但是CP001321
会发生什么呢?那是Perl语言吗?我是新来的!正如您发布的python tagNo need一样,它是python,速度足够快,如果我要加快速度,当然不会使用break语句。因为这个答案已经被接受,所以输出不再是问题。因为OP需要从关节转换;肺;CNS
tojointlungCNS
awk 'BEGIN { FS = OFS = "\t" } FNR==NR{a[$1]=$0;next}($1 in a){print a[$1],$2,$3}' File1 File2
file2_contents = [i.strip() for i in open("file2.txt").readlines()]
with open("file1.txt") as file1:
for each_line in file1.readlines():
code=each_line.split('\t')[0].strip()
part=each_line.split('\t')[1].strip()
for each_file2_contents in file2_contents:
if code in each_file2_contents:
print each_file2_contents+'\t'+part
awk '
BEGIN { FS=OFS="\t" }
NR==FNR { map[$1] = $2; next }
{
for (key in map)
if ($0 ~ key)
$0 = $0 OFS map[key]
print
}
' file1 file2
use warnings;
use strict;
use Data::Dumper;
my $str1 = <<"EOS1";
APBW lung
APCA non virulent
ABKM lung
APBX lung
KK020 -
APBZ non virulent
AOSU lung
APBY non virulent
APBV joint; lung; CNS
CP001321 virulent
APBT virulent
APBU non-virulent
APCB moderadamente virulenta (nose)
CP005384 -
EOS1
my $str2 = <<"EOS2";
HS372_00243 gi|219690483|gb|CP001321.1|
HS372_00436 gi|529264994|gb|APBX01000055.1|
HS372_00445 gi|529256455|gb|APBT01000061.1|
HS372_00544 gi|529259149|gb|APBV01000035.1|
HS372_00545 gi|529259149|gb|APBV01000035.1|
HS372_00546 gi|529259149|gb|APBV01000035.1|
EOS2
#HS372_00243 gi|219690483|gb|CP001321.1| virulent
my %data1;my %data2;
foreach my $line ( split(/\n+/,$str1) ){
my @f = split(/\t/,$line);
$data1{$f[0]} = $f[1];
}
my @data2 = split(/\n+/,$str2);
foreach my $line ( split(/\n+/,$str2) ){
my @f = split(/\t/,$line);
my @sf = split(/\|/,$f[1]);
$data2{$sf[3]} = $line;
}
#print Dumper(\%data2);
foreach my $s1 ( sort { length($b) <=> length($a) } keys %data1){
foreach my $d2 (@data2){
my @f = split(/\t/,$d2);
my @sf = split(/\|/,$f[1]);
if ($sf[3] =~ m!^$s1!is){
#warn "found $s1 in $d2\n";
print "$d2\t$data1{$s1}\n";
}
}
}
output:
HS372_00243 gi|219690483|gb|CP001321.1| virulent
HS372_00445 gi|529256455|gb|APBT01000061.1| virulent
HS372_00544 gi|529259149|gb|APBV01000035.1| joint; lung; CNS
HS372_00545 gi|529259149|gb|APBV01000035.1| joint; lung; CNS
HS372_00546 gi|529259149|gb|APBV01000035.1| joint; lung; CNS
HS372_00436 gi|529264994|gb|APBX01000055.1| lung