如何在生物信息学上并行运行python脚本_Python_Parallel Processing_Bioinformatics_Biopython

如何在生物信息学上并行运行python脚本

python parallel-processing

如何在生物信息学上并行运行python脚本,python,parallel-processing,bioinformatics,biopython,Python,Parallel Processing,Bioinformatics,Biopython,我希望使用python读入fasta序列文件并将其转换为熊猫数据帧。我使用以下脚本： from Bio import SeqIO import pandas as pd def fasta2df(infile): records = SeqIO.parse(infile, 'fasta') seqList = [] for record in records: desp = record.description # print(desp)

我希望使用python读入fasta序列文件并将其转换为熊猫数据帧。我使用以下脚本：

from Bio import SeqIO
import pandas as pd

def fasta2df(infile):
    records = SeqIO.parse(infile, 'fasta')
    seqList = []
    for record in records:
        desp = record.description
        # print(desp)
        seq = list(record.seq._data.upper())
        seqList.append([desp] + seq)
        seq_df = pd.DataFrame(seqList)
        print(seq_df.shape)
        seq_df.columns=['strainName']+list(range(1, seq_df.shape[1]))
    return seq_df


if __name__ == "__main__":
    path = 'path/to/the/fasta/file'
    input = path + 'GISAIDspikeprot0119.selection.fasta'
    df = fasta2df(input)

“GISAIDspikeprot0119.selection.fasta”文件可在以下位置找到：

该脚本只能在我的linux工作站上使用一个cpu内核运行，但是是否可以使用更多的内核（多个进程）来运行它，以便运行得更快？代码是什么

非常感谢

在使用更多CPU解决问题之前，您应该花一些时间检查代码的哪些部分速度慢

在您的例子中，您正在每个循环迭代中执行昂贵的转换

seq_df=pd.DataFrame（seqList）

。这只是浪费CPU时间，因为结果

seq_df

在下一次迭代中被覆盖

你的代码在我的机器上花了超过15分钟。将

pd.DataFrame（seqList）

和

print

语句移出循环后，时间缩短到约15秒

def fasta2df（填充）：
records=SeqIO.parse（填充'fasta'）
seqList=[]
记录中的记录：
desp=记录。描述
seq=列表（record.seq.\u data.upper（））
seqList.append（[desp]+seq）
seq_df=局部数据帧（seqList）
seq_df.columns=['strainName']+列表（范围（1，seq_df.shape[1]））
返回序列df

事实上，几乎所有的时间都花在线路上，对我来说大约13秒。通过将dtype显式设置为string，我们可以将其缩短到~7秒：

def fasta2df（填充）：
records=SeqIO.parse（填充'fasta'）
seqList=[]
记录中的记录：
desp=记录。描述
seq=列表（record.seq.\u data.upper（））
seqList.append（[desp]+seq）
seq_df=pd.DataFrame（seqList，dtype=“string”）
seq_df.columns=['strainName']+列表（范围（1，seq_df.shape[1]））
返回序列df

有了这一新的性能，我非常怀疑您是否能够通过并行处理进一步提高速度。

请看一下

多处理

模块。由于代码包装在函数中，所以可以尝试使用多处理映射进行实验。另外，如果您必须在多个文件上应用此代码，您可以同时启动多个脚本（在Linux shell中使用&在后台启动一个进程）。是的，我可以尝试，但我不熟悉。你能在这里给你一些代码作为参考吗？多处理可以指定调用的内核数，还是只调用计算机中所有可用的内核？