Python 在Shell中格式化Snakemake输入文件_Python_Bioinformatics_Pipeline_Snakemake

Python 在Shell中格式化Snakemake输入文件

python

Python 在Shell中格式化Snakemake输入文件,python,bioinformatics,pipeline,snakemake,Python,Bioinformatics,Pipeline,Snakemake,我正在使用一个snakemake管道来运行GATKMarkDuplicate命令，该命令包含来自不同读取组的多个输入bam文件 rule mark_duplicates: input: get_dedup_input output: bam=temp("bams/{patient}.{sample_type}.markdups.bam"), md5=temp("bams/{patient}.{sample_type}.markdup

我正在使用一个

snakemake

管道来运行GATK

MarkDuplicate

命令，该命令包含来自不同读取组的多个输入bam文件

rule mark_duplicates:
    input:
        get_dedup_input
    output:
        bam=temp("bams/{patient}.{sample_type}.markdups.bam"),
        md5=temp("bams/{patient}.{sample_type}.markdups.bam.md5"),
        metrics="qc/gatk/{patient}_{sample_type}_dup_metrics.txt"
    conda:
        "../envs/gatk.yml"
    shell:
        """
        gatk MarkDuplicates -I {input} -O {output.bam} -M {output.metrics} \
            --CREATE_MD5_FILE true --ASSUME_SORT_ORDER "queryname"
        """

get\u dedup\u input

返回输入bam文件的列表

MarkDuplicates

要求使用

-I

标志指定每个输入文件。如果我只有一个bam文件，我可以简单地编写

-I{input}

，但这会失败，因为它指定了

-I file1.bam file2.bam

它需要是

-I file1.bam-I file2.bam

。格式化输入的最佳方式是什么，以便将每个输入文件指定为

-I[input file]

下面有两个场景来说明如果手动运行命令，输入、输出和shell命令会是什么样子。为了简洁起见，我省略了一些非必要的

MarkDuplicate

标志：

一读组两个阅读组

谢谢你更新你的答案

因此，最好的办法可能是创建一个参数，使其成为我们需要的字符串，如下所示：

rule mark_duplicates:
    input:
        get_dedup_input
    output:
        bam=temp("bams/{patient}.{sample_type}.markdups.bam"),
        md5=temp("bams/{patient}.{sample_type}.markdups.bam.md5"),
        metrics="qc/gatk/{patient}_{sample_type}_dup_metrics.txt"
    conda:
        "../envs/gatk.yml"
    params:
        input=lambda wildcards, input: " -I ".join(input)
    shell:
        """
        gatk MarkDuplicates -I {params.input} -O {output.bam} -M {output.metrics} \
            --CREATE_MD5_FILE true --ASSUME_SORT_ORDER "queryname"
        """

如果我们的输入是单个bam，

params.input

将只是

patient101.normal.rg1.bam

，并且

-I

将作为普通添加到前面

如果我们有两个输入BAM，我们的lambda函数会将

-I

放在它们之间，shell命令会在前面添加

-I

。

您能给我们一个示例，说明您希望从该规则中获得哪些输入和输出吗？Post Update会显示一个输入和输出文件示例，以及命令的运行方式！我现在在火车上，但当我到达时，我会发布一个答案（如果其他人还没有：））谢谢，这是个好主意！经过一些快速测试，当字符串需要为

-I file1.bam-I file2.bam时，params.input看起来像是file1.bam-I file2.bam
。将lambda修改为input=lambda通配符，input:[“-I”+f表示input中的f]修复了该问题。您能用更新的lambda修改您的解决方案，以便我可以将其标记为正确的吗？再次感谢你的帮助！当然可以，但是shell命令中的-I应该捕捉到这一点，对吗？
Inputs: patient101.normal.rg1.bam, patient101.normal.rg2.bam
Output: patient101.normal.markdups.bam
Shell: 
gatk MarkDuplicates -I patient101.normal.rg1.bam \
-I patient101.normal.rg2.bam \
-O patient101.normal.markdups.bam \
-M metrics.txt

rule mark_duplicates:
    input:
        get_dedup_input
    output:
        bam=temp("bams/{patient}.{sample_type}.markdups.bam"),
        md5=temp("bams/{patient}.{sample_type}.markdups.bam.md5"),
        metrics="qc/gatk/{patient}_{sample_type}_dup_metrics.txt"
    conda:
        "../envs/gatk.yml"
    params:
        input=lambda wildcards, input: " -I ".join(input)
    shell:
        """
        gatk MarkDuplicates -I {params.input} -O {output.bam} -M {output.metrics} \
            --CREATE_MD5_FILE true --ASSUME_SORT_ORDER "queryname"
        """