在python中，将一行的一部分与另一个文件中的每一行进行比较_Python_Python 3.x_Vcf Variant Call Format

在python中，将一行的一部分与另一个文件中的每一行进行比较

python python-3.x

在python中，将一行的一部分与另一个文件中的每一行进行比较,python,python-3.x,vcf-variant-call-format,Python,Python 3.x,Vcf Variant Call Format,我试图比较一个文件中的一行，并将另一个文件中的所有匹配行放入一个输出文件中。例如，这里是第一个文件 chr8 18 . T T * * chr8 29 . C T . . chr9 21 . TA T . . chr18 22 . C T . .

我试图比较一个文件中的一行，并将另一个文件中的所有匹配行放入一个输出文件中。例如，这里是第一个文件

chr8    18      .       T       T       *       *
chr8    29      .       C       T       .       .
chr9    21      .       TA      T       .       .
chr18    22      .       C       T       .       .
chr18    23      .       A       G       .       .

下面是另一个文件：

chr8    ensembl CDS     1       1042    .       -       0       gene_id "ENSCAFG00000031632"; gene_version "1"; transcript_id "ENSCAFT00000048171"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000042624"; protein_version "1";
chr8    ensembl CDS     27     1227    .       +       0       gene_id "ENSCAFG00000032228"; gene_version "1"; transcript_id "ENSCAFT00000037896"; transcript_version "2"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000033535"; protein_version "2";
chr8    ensembl CDS     41      1006    .       -       0       gene_id "ENSCAFG00000029302"; gene_version "1"; transcript_id "ENSCAFT00000048043"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000036901"; protein_version "1";

我想要的输出是：

chr8    18      .       T       T       *       *
chr8    ensembl CDS     1       1042    .       -       0       gene_id "ENSCAFG00000031632"; gene_version "1"; transcript_id "ENSCAFT00000048171"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000042624"; protein_version "1";
chr8    29      .       C       T       .       .   
chr8    ensembl CDS     1       1042    .       -       0       gene_id "ENSCAFG00000031632"; gene_version "1"; transcript_id "ENSCAFT00000048171"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000042624"; protein_version "1";
chr8    ensembl CDS     27     1227    .       +       0       gene_id "ENSCAFG00000032228"; gene_version "1"; transcript_id "ENSCAFT00000037896"; transcript_version "2"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00000033535"; protein_version "2";

所以我想获取第一个文件的每一行，查找每一行并搜索第一列是否匹配，如果第1列匹配，文件1中的第二个数字在第4列和第5列的范围内。然后，如果它们匹配，则使用第一个文件中的行编写一个新文件，并在该文件下使用文件2中的所有匹配行。以下是我尝试过的：

opt=''
with open('file1.vcf') as vfh:
    with open('file2.gtf') as gfh:
        for line in vfh:
                ct=0
                vll=line.split('\t')
                for gline in gfh:
                    gll=gline.split('\t')
                    if vll[0] == gll[0]:
                        if (int(vll[1]) > int(gll[3])) and (int(vll[1]) < int(gll[4])):
                            while ct < 1:
                                opt+=line
                                ct+=1
                            opt+=gline
with open('out.txt','w') as fh:
    fh.write(opt)

opt=''
将open（'file1.vcf'）作为vfh：
将open（'file2.gtf'）作为gfh：
对于vfh中的线路：
ct=0
vll=line.split（'\t'）
对于gfh中的gline：
gll=gline.split（'\t'）
如果vll[0]==gll[0]：
如果（int（vll[1]）>int（gll[3]）和（int（vll[1]）


但是我从来没有得到我想要的输出。
我相信你的索引是错误的
if (int(vll[1]) > int(gll[3])) and (int(vll[1]) < int(gll[4])):

if（int（vll[1]）>int（gll[3]）和（int（vll[1]）

“vll[1]”是18
“gll[3]”是1042，因为“ensembl CD”似乎由“not”分隔\t
请尝试使用调试器并验证索引。
发现问题，只需移动with open语句即可。另外，我还添加了一些内容来处理原始文件中的一些注释：
with open('a1.vcf') as vfh:
    for line in vfh:
        if '#' not in line[0]:
            ct=0
            vll=line.split('\t')
            with open('cds.gtf') as gfh:
                for gline in gfh:
                    gll=gline.split('\t')
                    if vll[0] == gll[0]:
                        if (int(vll[1]) > int(gll[3])) and (int(vll[1]) < int(gll[4])):
                            while ct < 1:
                                opt+=line
                                ct+=1
                            opt+=gline

将open（'a1.vcf'）作为vfh的：
对于vfh中的线路：
如果“#”不在[0]行中：
ct=0
vll=line.split（'\t'）
打开（'cds.gtf'）作为gfh：
对于gfh中的gline：
gll=gline.split（'\t'）
如果vll[0]==gll[0]：
如果（int（vll[1]）>int（gll[3]）和（int（vll[1]）
我也这么认为，但出于某种原因，这实际上是一个\t，它打印为一个空格。`>>行'chr8\tensembl\tCDS\t400\t1227\t.\t+\t0\tgene\u id`