使用Python将多个文件中的项组合到一个矩阵中_Python_Matrix

使用Python将多个文件中的项组合到一个矩阵中

python matrix

使用Python将多个文件中的项组合到一个矩阵中,python,matrix,Python,Matrix,我试图将多个文件（1.20.1_Indel_allEff.vcf、1.20.2_Indel_allEff.vcf…1.200.1_Indel_allEff.vcf）中的项目组合到一个文件夹中，以获得类似于以下内容的矩阵 Fm Chromosome Position Ref Alt Gene X1.20.1 X1.20.2 X1.20.3 Fm chrI 100007 AT A CAR2 0 0 0 Fm c

我试图将多个文件（1.20.1_Indel_allEff.vcf、1.20.2_Indel_allEff.vcf…1.200.1_Indel_allEff.vcf）中的项目组合到一个文件夹中，以获得类似于以下内容的矩阵

Fm Chromosome Position Ref Alt Gene  X1.20.1 X1.20.2 X1.20.3    
Fm        chrI       100007   AT  A   CAR2  0       0       0  
Fm        chrX       3000676  G   T   HYM1  0       0       0.5

其中，

X1.20.1

，

X1.20.2

，

X1.20.3

…

X1.200.3

是文件夹中包含的各个文件名及其频率值

我用python编写了一段代码（F1_comparison.py）

但是，我遇到了一个错误，代码无法识别文件夹中的一些文件，尽管它们的格式都相同

我的命令和我得到的错误：

python F1_comparison.py Fer1 > output.csv

Traceback (most recent call last):
    File "Fer1_comparison.py", line 18, in <module>
    f = open(f1)
    IOError: [Errno 2] No such file or directory: '1.30.2_INDEL_allEff.vcf'

python F1_comparison.py Fer1>output.csv
回溯（最近一次呼叫最后一次）：
文件“Fer1_comparison.py”，第18行，在
f=打开（f1）
IOError:[Errno 2]没有这样的文件或目录：“1.30.2_INDEL_allEff.vcf”

有人能帮我解决这个问题吗？这将是一个很大的帮助。谢谢

您需要将文件加入路径：

from os import path, listdir

pth = sys.argv[1]  # get full path
myfiles = listdir(pth) # get list of all files in that path/directory
for f1 in myfiles:
    with open(path.join(pth,f1)) as f: # join -> pth/f1. with also closes your file
        tpp = f1.split("_",1)[0].split(".")
        tp = ".".join(tpp[0:3]) # same as tp=tpp[0]+'.'+tpp[1]+'.'+tpp[2]
        for line in f:
            # continue your code ...

您可以使用切片、解包str.format和不重复拆分来编写代码，使其更简洁、更高效：

from os import path, listdir
import sys
from collections import defaultdict

snps = defaultdict(lambda: defaultdict(str))
pth = sys.argv[1]  # get full path
myfiles = listdir(pth)  # get list of all files in that path/directory

with open("Fer1_INDELs_clones_filtered.csv","w") as out: # file to write all filtered data to
    out.write("Fermentor\tTrajectory\tChromosome\tPosition\tMutation\tGene\tEffect\t1.20.1\t1.20.2\t1.20.3\t1.30.1\t1.30.2\t1.30.3\t1.40.1\t1.40.3\t1.50.1\t1.50.2\t1.50.3\t1.60.1\t1.60.2\t1.60.3\t1.90.1\t1.90.2\t1.90.3\t1.100.1\t1.100.2\t1.100.3\t1.130.1\t1.130.2\t1.130.3\t1.200.1\t1.200.2\t1.200.3\n")
    for f1 in myfiles:
        with open(path.join(pth, f1)) as f:  # join -> pth/f1
            tpp = f1.split("_", 1)[0].split(".")
            tp = ".".join(tpp[0:3])  # same as tp=tpp[0]+'.'+tpp[1]+'.'+tpp[2]
            for line in f:
                ls = line.split()
                if line.find("#") == -1 and len(ls) > 6: 
                    print(line)
                    # use unpacking and slicing
                    chrom, pos, ref, alt, freq, typ, gene = ls[:7]
                    if len(alt) == 1:
                        # use str.fromat
                        snps["{}_{}-{}_{}_{}_{}".format(pos,ref,alt,chrom,gene,typ)][tp] = freq
                    elif len(alt) > 1:
                        # use enumerate
                        for ind,k in enumerate(alt.split(",")):
                            snps["{}_{}_{}_{}_{}_{}".format(pos,ref,k,chrom,gene,typ)][tp] = freq.split(",")[ind]
    traj = 1
    tp_list = ['1.20.1', '1.20.2', '1.20.3', '1.30.1', '1.30.2', '1.30.3', '1.40.1', '1.40.2', '1.40.3', '1.50.1', '1.50.2',
               '1.50.3', '1.60.1', '1.60.2', '1.60.3', '1.90.1', '1.90.2', '1.90.3', '1.100.1', '1.100.2', '1.100.3',
               '1.130.1', '1.130.2', '1.130.3', '1.200.1', '1.200.2', '1.200.3']
    for pos in sorted(snps):
        # split once and again use unpacking and slicing 
        pos1, mut, chrom, gene, typ = pos.split("_")[:5]
        tp_string = ""
        for tp in tp_list:
            #print(tp)
            if snps[pos][tp]: # empty value will be False no need to check len
                tp_string += "\t{}".format(snps[pos][tp])
            else:
                tp_string += "\t0/0"

        out.write(("F1{}\t{}\t{}\t{}\t{}\t{}\t{}\n".format(traj,chrom,pos1,mut,gene,typ,tp_string)))
        traj += 1

嗨，帕德雷克，我核对了我的命令。我从同一个目录运行了代码。但是我仍然得到一个错误。@pm2ring：我更正了缩进。在这里复制代码时，由于缩进而导致该错误。你能推荐一个替代方案吗？谢谢。我已经用更正更新了代码。嗨，Padriac，我仍然面临一些代码问题。无法理解。出什么事了。你能帮个忙吗？当然，有什么问题吗？我犯的错误和我一开始犯的一样。代码无法识别某些文件，尽管它们存在于文件夹中。这就是我要找的bro。非常感谢！感谢你的时间和帮助。只有一个小问题。如果您观察到，在某些情况下，列是混淆的。我的意思是，整个过程中都没有维持秩序。你能建议一个解决这个问题的方法吗？

from os import path, listdir
import sys
from collections import defaultdict

snps = defaultdict(lambda: defaultdict(str))
pth = sys.argv[1]  # get full path
myfiles = listdir(pth)  # get list of all files in that path/directory

with open("Fer1_INDELs_clones_filtered.csv","w") as out: # file to write all filtered data to
    out.write("Fermentor\tTrajectory\tChromosome\tPosition\tMutation\tGene\tEffect\t1.20.1\t1.20.2\t1.20.3\t1.30.1\t1.30.2\t1.30.3\t1.40.1\t1.40.3\t1.50.1\t1.50.2\t1.50.3\t1.60.1\t1.60.2\t1.60.3\t1.90.1\t1.90.2\t1.90.3\t1.100.1\t1.100.2\t1.100.3\t1.130.1\t1.130.2\t1.130.3\t1.200.1\t1.200.2\t1.200.3\n")
    for f1 in myfiles:
        with open(path.join(pth, f1)) as f:  # join -> pth/f1
            tpp = f1.split("_", 1)[0].split(".")
            tp = ".".join(tpp[0:3])  # same as tp=tpp[0]+'.'+tpp[1]+'.'+tpp[2]
            for line in f:
                ls = line.split()
                if line.find("#") == -1 and len(ls) > 6: 
                    print(line)
                    # use unpacking and slicing
                    chrom, pos, ref, alt, freq, typ, gene = ls[:7]
                    if len(alt) == 1:
                        # use str.fromat
                        snps["{}_{}-{}_{}_{}_{}".format(pos,ref,alt,chrom,gene,typ)][tp] = freq
                    elif len(alt) > 1:
                        # use enumerate
                        for ind,k in enumerate(alt.split(",")):
                            snps["{}_{}_{}_{}_{}_{}".format(pos,ref,k,chrom,gene,typ)][tp] = freq.split(",")[ind]
    traj = 1
    tp_list = ['1.20.1', '1.20.2', '1.20.3', '1.30.1', '1.30.2', '1.30.3', '1.40.1', '1.40.2', '1.40.3', '1.50.1', '1.50.2',
               '1.50.3', '1.60.1', '1.60.2', '1.60.3', '1.90.1', '1.90.2', '1.90.3', '1.100.1', '1.100.2', '1.100.3',
               '1.130.1', '1.130.2', '1.130.3', '1.200.1', '1.200.2', '1.200.3']
    for pos in sorted(snps):
        # split once and again use unpacking and slicing 
        pos1, mut, chrom, gene, typ = pos.split("_")[:5]
        tp_string = ""
        for tp in tp_list:
            #print(tp)
            if snps[pos][tp]: # empty value will be False no need to check len
                tp_string += "\t{}".format(snps[pos][tp])
            else:
                tp_string += "\t0/0"

        out.write(("F1{}\t{}\t{}\t{}\t{}\t{}\t{}\n".format(traj,chrom,pos1,mut,gene,typ,tp_string)))
        traj += 1