使用python运行的循环'；s脚本，包括linux和其他脚本_Python_Unix

使用python运行的循环'；s脚本，包括linux和其他脚本

python unix

使用python运行的循环'；s脚本，包括linux和其他脚本,python,unix,Python,Unix,我在python脚本中使用一个简单的循环来迭代一个程序，迭代次数与一个文件夹中的文件相同。我正在开发这个脚本，所以目前我的输入中有3个文件。所以我希望有3个文件作为输出。因为一个我无法解释的原因，我只得到1 我向您展示了我的所有代码，但我指定了问题所在这是一些设置不要担心这个现在，一些运行程序的代码也不用担心在这里，我的代码运行程序，循环获取_protein.py是一个脚本，如下所示 import pandas as pd import os #Read the fasta docum

我在python脚本中使用一个简单的循环来迭代一个程序，迭代次数与一个文件夹中的文件相同。我正在开发这个脚本，所以目前我的输入中有3个文件。所以我希望有3个文件作为输出。因为一个我无法解释的原因，我只得到1

我向您展示了我的所有代码，但我指定了问题所在

这是一些设置不要担心这个现在，一些运行程序的代码也不用担心在这里，我的代码运行程序，循环获取_protein.py是一个脚本，如下所示

import pandas as pd
import os

#Read the fasta document

def fasta_parser(myfile):
    with open(myfile) as f:

        header = ""
        seq = ""
        for line in f:
            if line[0] == ">":
                if seq != "":
                    yield (header[1:], seq)
                    header = line.strip()
                    seq = ""
                else:
                    header = line.strip()
            else:
                seq += line.strip()
        yield (header[1:], seq)





#Transform the document into a text string called "Sequence"
Sequence =[]
for header, seq in fasta_parser('/Users/monkiky/Desktop/control/output.align/gtdbtk.bac120.user_msa.fasta'):
    print(header,seq[:100])
    Sequence = seq




#The following code saves the length of each protein
f = open("/Users/monkiky/Desktop/GTDB/proteins_len.log", "r")
LEN= []
import re
for line in f:
    secuence = re.search('LEN:(\d+)', line)
    if secuence:
        LEN.append(secuence.group(1))

# To save the name of the protein

f = open("/Users/monkiky/Desktop/GTDB/proteins_len.log", "r")
Names =[]
for i, line in enumerate(f):
    if i > 19:
        Names.append(line.split(" ")[3])


Names = Names[:-9]# Last 9 lines in the docuement are not genes.
#Names
#['Ribosomal_S9:',
# 'Ribosomal_S8:',
# 'Ribosomal_L10:',
# 'GrpE:',
# 'DUF150:',
# 'PNPase:',
# 'TIGR00006:',
# ...


# To create a df with both lists

bac120 = {'protein_name':Names,'LEN':LEN}
df = pd.DataFrame(bac120)
# Calculate the protein sequences in the concatenated
df['LEN'] = df['LEN'].astype(int)
s = df['LEN'].cumsum()
df = df.assign(Start = s.shift(fill_value=0), End = s)
#print (df)

#protein_name  LEN  Start    End
#0     Ribosomal_S9:  121      0    121
#1     Ribosomal_S8:  129    121    250
#2    Ribosomal_L10:  100    250    350
#3             GrpE:  166    350    516
#4           DUF150:  141    516    657
#..              ...  ...    ...    ...
#115      TIGR03632:  117  40149  40266
#116      TIGR03654:  175  40266  40441
#117      TIGR03723:  314  40441  40755
#118      TIGR03725:  212  40755  40967
#119      TIGR03953:  188  40967  41155


#We added the name of the bacterium with the name of the protein
df['protein_name'] = header + ' ' + df['protein_name']



Lets create our fasta file using a dict



mydict = {}

for index,row in df.iterrows():
    mydict[row['protein_name']] =  Sequence[row['Start']:row['End']]


secuencias = [ v for v in mydict.values() ]
nombres = [k for k in mydict]

ofile = open("/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fasta.fasta", "w")

for i in range(len(secuencias)):

    ofile.write(">" + nombres[i] + "\n" +secuencias[i] + "\n")

ofile.close()


# Remove the "-" and change the name of the final file
import os
with open(r'/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fasta.fasta', 'r') as infile, open(r'/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fastaa.fasta', 'w') as outfile:
        data = infile.read()
        data = data.replace("-", "")
        outfile.write(data)
myfile = "/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fastaa.fasta"
path = '/Users/monkiky/Desktop/control/ultimate_output/concatenates/'
os.rename("/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fastaa.fasta", "/Users/monkiky/Desktop/control/ultimate_output/concatenates/" + str(header) + ".fasta")

我真的不知道问题出在哪里。我不知道只生成了一个文件

当我阅读终端中的流程时，我可以看到com1和com2如何在循环中运行2次，为什么？？？当应为三时，最终生成一个文件

如果有帮助的话，我会告诉你终端显示什么

(base) monkikys-Mini:control monkiky$ ./commands 
[2020-03-23 13:17:09] INFO: GTDB-Tk v1.0.2
[2020-03-23 13:17:09] INFO: gtdbtk identify --genome_dir /Users/monkiky/Desktop/GTDB/input --out_dir /Users/monkiky/Desktop/control/output
[2020-03-23 13:17:09] INFO: Using GTDB-Tk reference data version r89: /Users/monkiky/Desktop/GTDB/gtdbtk/release89
[2020-03-23 13:17:09] INFO: Identifying markers in 1 genomes with 1 threads.
[2020-03-23 13:17:09] INFO: Running Prodigal V2.6.3 to identify genes.
==> Finished processing 1 of 1 (100.0%) genomes.
[2020-03-23 13:17:22] INFO: Identifying TIGRFAM protein families.
==> Finished processing 1 of 1 (100.0%) genomes.
[2020-03-23 13:17:28] INFO: Identifying Pfam protein families.
==> Finished processing 1 of 1 (100.0%) genomes.
[2020-03-23 13:17:28] INFO: Annotations done using HMMER 3.3 (Nov 2019)
[2020-03-23 13:17:29] INFO: Done.


##### Here com2 finishs

[2020-03-23 13:17:29] INFO: GTDB-Tk v1.0.2
[2020-03-23 13:17:29] INFO: gtdbtk align --identify_dir /Users/monkiky/Desktop/control/output --skip_trimming --out_dir /Users/monkiky/Desktop/control/output.align
[2020-03-23 13:17:29] INFO: Using GTDB-Tk reference data version r89: /Users/monkiky/Desktop/GTDB/gtdbtk/release89
[2020-03-23 13:17:29] INFO: Aligning markers in 1 genomes with 1 threads.
[2020-03-23 13:17:29] INFO: Processing 1 genomes identified as bacterial.
[2020-03-23 13:17:32] INFO: Read concatenated alignment for 23458 GTDB genomes.
==> Finished aligning 1 of 1 (100.0%) genomes.
[2020-03-23 13:17:36] INFO: Skipping custom filtering and selection of columns.
[2020-03-23 13:17:36] INFO: Creating concatenated alignment for 23459 GTDB and user genomes.
[2020-03-23 13:17:51] INFO: Creating concatenated alignment for 1 user genomes.
[2020-03-23 13:17:51] INFO: Done.


##### Here com2 finish egain

GCA_000010565.1_genomic GRRKNAIARVFAMPGEGRIIINNRPLSEYFGRKTLETIVRQPLDLTGTASRFDIMAKVQGGGISGQAGAIKLGIARALIQADPNLRPVLKKAGFLTRDPR

可能问题是我使用的脚本在另一个脚本中。我不知道。我是信息学新手，所以我认为这是正确的方法

任何关于我的问题或代码中的小问题的建议都是非常受欢迎的

顺便说一句，我也是这个StackOverflow社区的新成员，如果您发现任何错误，例如“必须如何提问”，请让我知道。很高兴改进并正确执行。

您认为

打印（“rm-r/Users/monkiky/Desktop/control/output”）

有什么作用？请删除该文件夹。我曾读到，删除文件夹内容的最佳方法是删除文件夹，然后创建一个新文件夹。您认为

print（“rm-r/Users/monkiky/Desktop/control/output”）

的作用是什么？删除文件夹。我已经读到，删除一个文件夹内容的最佳方法是删除该文件夹，然后创建一个新文件夹。

for myfile in allfiles:
    if myfile.endswith(".fna"):
        print(com1)
        print(com2)
        #com2 generates many files I script select what I want and some manipulation throught 
        #the next script.
        print("python /Users/monkiky/Desktop/control/getting_protein.py")
        # Remove all files we dont need
        print("rm -r /Users/monkiky/Desktop/control/output")
        print("mkdir /Users/monkiky/Desktop/control/output")
        print("rm -r /Users/monkiky/Desktop/control/output.align")
        print("mkdir /Users/monkiky/Desktop/control/output.align")

import pandas as pd
import os

#Read the fasta document

def fasta_parser(myfile):
    with open(myfile) as f:

        header = ""
        seq = ""
        for line in f:
            if line[0] == ">":
                if seq != "":
                    yield (header[1:], seq)
                    header = line.strip()
                    seq = ""
                else:
                    header = line.strip()
            else:
                seq += line.strip()
        yield (header[1:], seq)





#Transform the document into a text string called "Sequence"
Sequence =[]
for header, seq in fasta_parser('/Users/monkiky/Desktop/control/output.align/gtdbtk.bac120.user_msa.fasta'):
    print(header,seq[:100])
    Sequence = seq




#The following code saves the length of each protein
f = open("/Users/monkiky/Desktop/GTDB/proteins_len.log", "r")
LEN= []
import re
for line in f:
    secuence = re.search('LEN:(\d+)', line)
    if secuence:
        LEN.append(secuence.group(1))

# To save the name of the protein

f = open("/Users/monkiky/Desktop/GTDB/proteins_len.log", "r")
Names =[]
for i, line in enumerate(f):
    if i > 19:
        Names.append(line.split(" ")[3])


Names = Names[:-9]# Last 9 lines in the docuement are not genes.
#Names
#['Ribosomal_S9:',
# 'Ribosomal_S8:',
# 'Ribosomal_L10:',
# 'GrpE:',
# 'DUF150:',
# 'PNPase:',
# 'TIGR00006:',
# ...


# To create a df with both lists

bac120 = {'protein_name':Names,'LEN':LEN}
df = pd.DataFrame(bac120)
# Calculate the protein sequences in the concatenated
df['LEN'] = df['LEN'].astype(int)
s = df['LEN'].cumsum()
df = df.assign(Start = s.shift(fill_value=0), End = s)
#print (df)

#protein_name  LEN  Start    End
#0     Ribosomal_S9:  121      0    121
#1     Ribosomal_S8:  129    121    250
#2    Ribosomal_L10:  100    250    350
#3             GrpE:  166    350    516
#4           DUF150:  141    516    657
#..              ...  ...    ...    ...
#115      TIGR03632:  117  40149  40266
#116      TIGR03654:  175  40266  40441
#117      TIGR03723:  314  40441  40755
#118      TIGR03725:  212  40755  40967
#119      TIGR03953:  188  40967  41155


#We added the name of the bacterium with the name of the protein
df['protein_name'] = header + ' ' + df['protein_name']



Lets create our fasta file using a dict



mydict = {}

for index,row in df.iterrows():
    mydict[row['protein_name']] =  Sequence[row['Start']:row['End']]


secuencias = [ v for v in mydict.values() ]
nombres = [k for k in mydict]

ofile = open("/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fasta.fasta", "w")

for i in range(len(secuencias)):

    ofile.write(">" + nombres[i] + "\n" +secuencias[i] + "\n")

ofile.close()


# Remove the "-" and change the name of the final file
import os
with open(r'/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fasta.fasta', 'r') as infile, open(r'/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fastaa.fasta', 'w') as outfile:
        data = infile.read()
        data = data.replace("-", "")
        outfile.write(data)
myfile = "/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fastaa.fasta"
path = '/Users/monkiky/Desktop/control/ultimate_output/concatenates/'
os.rename("/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fastaa.fasta", "/Users/monkiky/Desktop/control/ultimate_output/concatenates/" + str(header) + ".fasta")

(base) monkikys-Mini:control monkiky$ ./commands 
[2020-03-23 13:17:09] INFO: GTDB-Tk v1.0.2
[2020-03-23 13:17:09] INFO: gtdbtk identify --genome_dir /Users/monkiky/Desktop/GTDB/input --out_dir /Users/monkiky/Desktop/control/output
[2020-03-23 13:17:09] INFO: Using GTDB-Tk reference data version r89: /Users/monkiky/Desktop/GTDB/gtdbtk/release89
[2020-03-23 13:17:09] INFO: Identifying markers in 1 genomes with 1 threads.
[2020-03-23 13:17:09] INFO: Running Prodigal V2.6.3 to identify genes.
==> Finished processing 1 of 1 (100.0%) genomes.
[2020-03-23 13:17:22] INFO: Identifying TIGRFAM protein families.
==> Finished processing 1 of 1 (100.0%) genomes.
[2020-03-23 13:17:28] INFO: Identifying Pfam protein families.
==> Finished processing 1 of 1 (100.0%) genomes.
[2020-03-23 13:17:28] INFO: Annotations done using HMMER 3.3 (Nov 2019)
[2020-03-23 13:17:29] INFO: Done.


##### Here com2 finishs

[2020-03-23 13:17:29] INFO: GTDB-Tk v1.0.2
[2020-03-23 13:17:29] INFO: gtdbtk align --identify_dir /Users/monkiky/Desktop/control/output --skip_trimming --out_dir /Users/monkiky/Desktop/control/output.align
[2020-03-23 13:17:29] INFO: Using GTDB-Tk reference data version r89: /Users/monkiky/Desktop/GTDB/gtdbtk/release89
[2020-03-23 13:17:29] INFO: Aligning markers in 1 genomes with 1 threads.
[2020-03-23 13:17:29] INFO: Processing 1 genomes identified as bacterial.
[2020-03-23 13:17:32] INFO: Read concatenated alignment for 23458 GTDB genomes.
==> Finished aligning 1 of 1 (100.0%) genomes.
[2020-03-23 13:17:36] INFO: Skipping custom filtering and selection of columns.
[2020-03-23 13:17:36] INFO: Creating concatenated alignment for 23459 GTDB and user genomes.
[2020-03-23 13:17:51] INFO: Creating concatenated alignment for 1 user genomes.
[2020-03-23 13:17:51] INFO: Done.


##### Here com2 finish egain

GCA_000010565.1_genomic GRRKNAIARVFAMPGEGRIIINNRPLSEYFGRKTLETIVRQPLDLTGTASRFDIMAKVQGGGISGQAGAIKLGIARALIQADPNLRPVLKKAGFLTRDPR