使用python运行的循环';s脚本,包括linux和其他脚本
我在python脚本中使用一个简单的循环来迭代一个程序,迭代次数与一个文件夹中的文件相同。我正在开发这个脚本,所以目前我的输入中有3个文件。所以我希望有3个文件作为输出。因为一个我无法解释的原因,我只得到1 我向您展示了我的所有代码,但我指定了问题所在 这是一些设置不要担心这个 现在,一些运行程序的代码也不用担心 在这里,我的代码运行程序,循环 获取_protein.py是一个脚本,如下所示使用python运行的循环';s脚本,包括linux和其他脚本,python,unix,Python,Unix,我在python脚本中使用一个简单的循环来迭代一个程序,迭代次数与一个文件夹中的文件相同。我正在开发这个脚本,所以目前我的输入中有3个文件。所以我希望有3个文件作为输出。因为一个我无法解释的原因,我只得到1 我向您展示了我的所有代码,但我指定了问题所在 这是一些设置不要担心这个 现在,一些运行程序的代码也不用担心 在这里,我的代码运行程序,循环 获取_protein.py是一个脚本,如下所示 import pandas as pd import os #Read the fasta docum
import pandas as pd
import os
#Read the fasta document
def fasta_parser(myfile):
with open(myfile) as f:
header = ""
seq = ""
for line in f:
if line[0] == ">":
if seq != "":
yield (header[1:], seq)
header = line.strip()
seq = ""
else:
header = line.strip()
else:
seq += line.strip()
yield (header[1:], seq)
#Transform the document into a text string called "Sequence"
Sequence =[]
for header, seq in fasta_parser('/Users/monkiky/Desktop/control/output.align/gtdbtk.bac120.user_msa.fasta'):
print(header,seq[:100])
Sequence = seq
#The following code saves the length of each protein
f = open("/Users/monkiky/Desktop/GTDB/proteins_len.log", "r")
LEN= []
import re
for line in f:
secuence = re.search('LEN:(\d+)', line)
if secuence:
LEN.append(secuence.group(1))
# To save the name of the protein
f = open("/Users/monkiky/Desktop/GTDB/proteins_len.log", "r")
Names =[]
for i, line in enumerate(f):
if i > 19:
Names.append(line.split(" ")[3])
Names = Names[:-9]# Last 9 lines in the docuement are not genes.
#Names
#['Ribosomal_S9:',
# 'Ribosomal_S8:',
# 'Ribosomal_L10:',
# 'GrpE:',
# 'DUF150:',
# 'PNPase:',
# 'TIGR00006:',
# ...
# To create a df with both lists
bac120 = {'protein_name':Names,'LEN':LEN}
df = pd.DataFrame(bac120)
# Calculate the protein sequences in the concatenated
df['LEN'] = df['LEN'].astype(int)
s = df['LEN'].cumsum()
df = df.assign(Start = s.shift(fill_value=0), End = s)
#print (df)
#protein_name LEN Start End
#0 Ribosomal_S9: 121 0 121
#1 Ribosomal_S8: 129 121 250
#2 Ribosomal_L10: 100 250 350
#3 GrpE: 166 350 516
#4 DUF150: 141 516 657
#.. ... ... ... ...
#115 TIGR03632: 117 40149 40266
#116 TIGR03654: 175 40266 40441
#117 TIGR03723: 314 40441 40755
#118 TIGR03725: 212 40755 40967
#119 TIGR03953: 188 40967 41155
#We added the name of the bacterium with the name of the protein
df['protein_name'] = header + ' ' + df['protein_name']
Lets create our fasta file using a dict
mydict = {}
for index,row in df.iterrows():
mydict[row['protein_name']] = Sequence[row['Start']:row['End']]
secuencias = [ v for v in mydict.values() ]
nombres = [k for k in mydict]
ofile = open("/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fasta.fasta", "w")
for i in range(len(secuencias)):
ofile.write(">" + nombres[i] + "\n" +secuencias[i] + "\n")
ofile.close()
# Remove the "-" and change the name of the final file
import os
with open(r'/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fasta.fasta', 'r') as infile, open(r'/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fastaa.fasta', 'w') as outfile:
data = infile.read()
data = data.replace("-", "")
outfile.write(data)
myfile = "/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fastaa.fasta"
path = '/Users/monkiky/Desktop/control/ultimate_output/concatenates/'
os.rename("/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fastaa.fasta", "/Users/monkiky/Desktop/control/ultimate_output/concatenates/" + str(header) + ".fasta")
我真的不知道问题出在哪里。我不知道只生成了一个文件
当我阅读终端中的流程时,我可以看到com1和com2如何在循环中运行2次,为什么???当应为三时,最终生成一个文件
如果有帮助的话,我会告诉你终端显示什么
(base) monkikys-Mini:control monkiky$ ./commands
[2020-03-23 13:17:09] INFO: GTDB-Tk v1.0.2
[2020-03-23 13:17:09] INFO: gtdbtk identify --genome_dir /Users/monkiky/Desktop/GTDB/input --out_dir /Users/monkiky/Desktop/control/output
[2020-03-23 13:17:09] INFO: Using GTDB-Tk reference data version r89: /Users/monkiky/Desktop/GTDB/gtdbtk/release89
[2020-03-23 13:17:09] INFO: Identifying markers in 1 genomes with 1 threads.
[2020-03-23 13:17:09] INFO: Running Prodigal V2.6.3 to identify genes.
==> Finished processing 1 of 1 (100.0%) genomes.
[2020-03-23 13:17:22] INFO: Identifying TIGRFAM protein families.
==> Finished processing 1 of 1 (100.0%) genomes.
[2020-03-23 13:17:28] INFO: Identifying Pfam protein families.
==> Finished processing 1 of 1 (100.0%) genomes.
[2020-03-23 13:17:28] INFO: Annotations done using HMMER 3.3 (Nov 2019)
[2020-03-23 13:17:29] INFO: Done.
##### Here com2 finishs
[2020-03-23 13:17:29] INFO: GTDB-Tk v1.0.2
[2020-03-23 13:17:29] INFO: gtdbtk align --identify_dir /Users/monkiky/Desktop/control/output --skip_trimming --out_dir /Users/monkiky/Desktop/control/output.align
[2020-03-23 13:17:29] INFO: Using GTDB-Tk reference data version r89: /Users/monkiky/Desktop/GTDB/gtdbtk/release89
[2020-03-23 13:17:29] INFO: Aligning markers in 1 genomes with 1 threads.
[2020-03-23 13:17:29] INFO: Processing 1 genomes identified as bacterial.
[2020-03-23 13:17:32] INFO: Read concatenated alignment for 23458 GTDB genomes.
==> Finished aligning 1 of 1 (100.0%) genomes.
[2020-03-23 13:17:36] INFO: Skipping custom filtering and selection of columns.
[2020-03-23 13:17:36] INFO: Creating concatenated alignment for 23459 GTDB and user genomes.
[2020-03-23 13:17:51] INFO: Creating concatenated alignment for 1 user genomes.
[2020-03-23 13:17:51] INFO: Done.
##### Here com2 finish egain
GCA_000010565.1_genomic GRRKNAIARVFAMPGEGRIIINNRPLSEYFGRKTLETIVRQPLDLTGTASRFDIMAKVQGGGISGQAGAIKLGIARALIQADPNLRPVLKKAGFLTRDPR
可能问题是我使用的脚本在另一个脚本中。我不知道。我是信息学新手,所以我认为这是正确的方法
任何关于我的问题或代码中的小问题的建议都是非常受欢迎的
顺便说一句,我也是这个StackOverflow社区的新成员,如果您发现任何错误,例如“必须如何提问”,请让我知道。很高兴改进并正确执行。您认为
打印(“rm-r/Users/monkiky/Desktop/control/output”)
有什么作用?请删除该文件夹。我曾读到,删除文件夹内容的最佳方法是删除文件夹,然后创建一个新文件夹。您认为print(“rm-r/Users/monkiky/Desktop/control/output”)
的作用是什么?删除文件夹。我已经读到,删除一个文件夹内容的最佳方法是删除该文件夹,然后创建一个新文件夹。
for myfile in allfiles:
if myfile.endswith(".fna"):
print(com1)
print(com2)
#com2 generates many files I script select what I want and some manipulation throught
#the next script.
print("python /Users/monkiky/Desktop/control/getting_protein.py")
# Remove all files we dont need
print("rm -r /Users/monkiky/Desktop/control/output")
print("mkdir /Users/monkiky/Desktop/control/output")
print("rm -r /Users/monkiky/Desktop/control/output.align")
print("mkdir /Users/monkiky/Desktop/control/output.align")
import pandas as pd
import os
#Read the fasta document
def fasta_parser(myfile):
with open(myfile) as f:
header = ""
seq = ""
for line in f:
if line[0] == ">":
if seq != "":
yield (header[1:], seq)
header = line.strip()
seq = ""
else:
header = line.strip()
else:
seq += line.strip()
yield (header[1:], seq)
#Transform the document into a text string called "Sequence"
Sequence =[]
for header, seq in fasta_parser('/Users/monkiky/Desktop/control/output.align/gtdbtk.bac120.user_msa.fasta'):
print(header,seq[:100])
Sequence = seq
#The following code saves the length of each protein
f = open("/Users/monkiky/Desktop/GTDB/proteins_len.log", "r")
LEN= []
import re
for line in f:
secuence = re.search('LEN:(\d+)', line)
if secuence:
LEN.append(secuence.group(1))
# To save the name of the protein
f = open("/Users/monkiky/Desktop/GTDB/proteins_len.log", "r")
Names =[]
for i, line in enumerate(f):
if i > 19:
Names.append(line.split(" ")[3])
Names = Names[:-9]# Last 9 lines in the docuement are not genes.
#Names
#['Ribosomal_S9:',
# 'Ribosomal_S8:',
# 'Ribosomal_L10:',
# 'GrpE:',
# 'DUF150:',
# 'PNPase:',
# 'TIGR00006:',
# ...
# To create a df with both lists
bac120 = {'protein_name':Names,'LEN':LEN}
df = pd.DataFrame(bac120)
# Calculate the protein sequences in the concatenated
df['LEN'] = df['LEN'].astype(int)
s = df['LEN'].cumsum()
df = df.assign(Start = s.shift(fill_value=0), End = s)
#print (df)
#protein_name LEN Start End
#0 Ribosomal_S9: 121 0 121
#1 Ribosomal_S8: 129 121 250
#2 Ribosomal_L10: 100 250 350
#3 GrpE: 166 350 516
#4 DUF150: 141 516 657
#.. ... ... ... ...
#115 TIGR03632: 117 40149 40266
#116 TIGR03654: 175 40266 40441
#117 TIGR03723: 314 40441 40755
#118 TIGR03725: 212 40755 40967
#119 TIGR03953: 188 40967 41155
#We added the name of the bacterium with the name of the protein
df['protein_name'] = header + ' ' + df['protein_name']
Lets create our fasta file using a dict
mydict = {}
for index,row in df.iterrows():
mydict[row['protein_name']] = Sequence[row['Start']:row['End']]
secuencias = [ v for v in mydict.values() ]
nombres = [k for k in mydict]
ofile = open("/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fasta.fasta", "w")
for i in range(len(secuencias)):
ofile.write(">" + nombres[i] + "\n" +secuencias[i] + "\n")
ofile.close()
# Remove the "-" and change the name of the final file
import os
with open(r'/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fasta.fasta', 'r') as infile, open(r'/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fastaa.fasta', 'w') as outfile:
data = infile.read()
data = data.replace("-", "")
outfile.write(data)
myfile = "/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fastaa.fasta"
path = '/Users/monkiky/Desktop/control/ultimate_output/concatenates/'
os.rename("/Users/monkiky/Desktop/control/ultimate_output/concatenates/my_fastaa.fasta", "/Users/monkiky/Desktop/control/ultimate_output/concatenates/" + str(header) + ".fasta")
(base) monkikys-Mini:control monkiky$ ./commands
[2020-03-23 13:17:09] INFO: GTDB-Tk v1.0.2
[2020-03-23 13:17:09] INFO: gtdbtk identify --genome_dir /Users/monkiky/Desktop/GTDB/input --out_dir /Users/monkiky/Desktop/control/output
[2020-03-23 13:17:09] INFO: Using GTDB-Tk reference data version r89: /Users/monkiky/Desktop/GTDB/gtdbtk/release89
[2020-03-23 13:17:09] INFO: Identifying markers in 1 genomes with 1 threads.
[2020-03-23 13:17:09] INFO: Running Prodigal V2.6.3 to identify genes.
==> Finished processing 1 of 1 (100.0%) genomes.
[2020-03-23 13:17:22] INFO: Identifying TIGRFAM protein families.
==> Finished processing 1 of 1 (100.0%) genomes.
[2020-03-23 13:17:28] INFO: Identifying Pfam protein families.
==> Finished processing 1 of 1 (100.0%) genomes.
[2020-03-23 13:17:28] INFO: Annotations done using HMMER 3.3 (Nov 2019)
[2020-03-23 13:17:29] INFO: Done.
##### Here com2 finishs
[2020-03-23 13:17:29] INFO: GTDB-Tk v1.0.2
[2020-03-23 13:17:29] INFO: gtdbtk align --identify_dir /Users/monkiky/Desktop/control/output --skip_trimming --out_dir /Users/monkiky/Desktop/control/output.align
[2020-03-23 13:17:29] INFO: Using GTDB-Tk reference data version r89: /Users/monkiky/Desktop/GTDB/gtdbtk/release89
[2020-03-23 13:17:29] INFO: Aligning markers in 1 genomes with 1 threads.
[2020-03-23 13:17:29] INFO: Processing 1 genomes identified as bacterial.
[2020-03-23 13:17:32] INFO: Read concatenated alignment for 23458 GTDB genomes.
==> Finished aligning 1 of 1 (100.0%) genomes.
[2020-03-23 13:17:36] INFO: Skipping custom filtering and selection of columns.
[2020-03-23 13:17:36] INFO: Creating concatenated alignment for 23459 GTDB and user genomes.
[2020-03-23 13:17:51] INFO: Creating concatenated alignment for 1 user genomes.
[2020-03-23 13:17:51] INFO: Done.
##### Here com2 finish egain
GCA_000010565.1_genomic GRRKNAIARVFAMPGEGRIIINNRPLSEYFGRKTLETIVRQPLDLTGTASRFDIMAKVQGGGISGQAGAIKLGIARALIQADPNLRPVLKKAGFLTRDPR