Python Snakemake-从输入文件动态派生目标_Python_Python 3.x_Bioinformatics_Snakemake

Python Snakemake-从输入文件动态派生目标

python python-3.x

Python Snakemake-从输入文件动态派生目标,python,python-3.x,bioinformatics,snakemake,Python,Python 3.x,Bioinformatics,Snakemake,我有大量这样组织的输入文件： data/ ├── set1/ │ ├── file1_R1.fq.gz │ ├── file1_R2.fq.gz │ ├── file2_R1.fq.gz │ ├── file2_R2.fq.gz | : │ └── fileX_R2.fq.gz ├── another_set/ │ ├── asdf1_R1.fq.gz │ ├── asdf1_R2.fq.gz │ ├── asdf2_R1.fq.gz │ ├── asdf

我有大量这样组织的输入文件：

data/
├── set1/
│   ├── file1_R1.fq.gz
│   ├── file1_R2.fq.gz
│   ├── file2_R1.fq.gz
│   ├── file2_R2.fq.gz
|   :
│   └── fileX_R2.fq.gz
├── another_set/
│   ├── asdf1_R1.fq.gz
│   ├── asdf1_R2.fq.gz
│   ├── asdf2_R1.fq.gz
│   ├── asdf2_R2.fq.gz
|   :
│   └── asdfX_R2.fq.gz
:   
└── many_more_sets/
    ├── zxcv1_R1.fq.gz
    ├── zxcv1_R2.fq.gz
    :
    └── zxcvX_R2.fq.gz

如果您熟悉生物信息学-这些当然是来自成对末端测序运行的fastq文件。我正在尝试生成一个snakemake工作流来读取所有这些内容，但我在第一条规则上已经失败了。这是我的尝试：

configfile: "config.yaml"

rule all:
    input:
        read1=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R1.fq.gz", output=config["output"]),
        read2=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R2.fq.gz", output=config["output"])

rule clip_and_trim_reads:
    input:
        read1=expand("{data}/{set}/{{sample}}_R1.fq.gz", data=config["raw_data"], set=config["sets"]),
        read2=expand("{data}/{set}/{{sample}}_R2.fq.gz", data=config["raw_data"], set=config["sets"])
    output:
        read1=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R1.fq.gz", output=config["output"]),
        read2=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R2.fq.gz", output=config["output"])
    threads: 16
    shell:
        """
        someTool -o {output.read1} -p {output.read2} \
        {input.read1} {input.read2}
        """

我无法将

clip\u和\u trim\u reads

指定为目标，因为

目标规则可能不包含通配符。

我尝试添加

所有

规则，但这给了我另一个错误：

$ snakemake -np
Building DAG of jobs...
WildcardError in line 3 of /work/project/Snakefile:
Wildcards in input files cannot be determined from output files:
'sample'

我还尝试对

all

规则使用

dynamic（）
$ snakemake -np
Dynamic output is deprecated in favor of checkpoints (see docs). It will be removed in Snakemake 6.0.
Building DAG of jobs...
MissingInputException in line 7 of /work/project/ladsie_002/analyses/scripts/2019-08-02_splice_leader_HiC/Snakefile:
Missing input files for rule clip_and_trim_reads:
data/raw_data/set1/__snakemake_dynamic___R1.fq.gz
data/raw_data/set1/__snakemake_dynamic___R2.fq.gz
data/raw_data/set1/__snakemake_dynamic___R2.fq.gz
data/raw_data/set1/__snakemake_dynamic___R1.fq.gz
[...]

我有100多个不同的文件，所以我非常希望避免创建包含所有文件名的列表。有没有办法做到这一点？
我想你误解了snakemake的工作原理。运行snakemake时，您可以在命令行上定义所需的输出，否则将生成Snakefile中第一条规则（您的所有规则）的输入。由于没有指定任何输出（snakemake-np
），snakemake将尝试生成规则all的输入
规则的输入基本上都是：
"somepath/clipped_and_trimmed_reads/{sample}_R1.fq.gz"

不幸的是，Snakemake不知道如何从中生成输出。。。我们需要告诉Snakemake使用哪些文件。我们可以手动执行此操作：
rule all:
    input:
        "somepath/clipped_and_trimmed_reads/file1_R1.fq.gz",
        "somepath/clipped_and_trimmed_reads/asdf1_R1.fq.gz",
        "somepath/clipped_and_trimmed_reads/zxcv1_R1.fq.gz"

但是当我们得到更多的文件时，这会变得相当麻烦，正如您在问题中指定的，这不是您想要的。我们需要编写一个小函数来获取所有文件名
import glob
import re

data=config["raw_data"]
samples = []
locations = {}
for file in glob.glob(data + "/**", recursive=True):
    if '_R1.fq.gz' in file:
        split = re.split('/|_R1', file)
        filename, directory = split[-2], split[-3]
        locations[filename] = directory  # we will need this one later
        samples.append(filename)

现在，我们可以将其输入到我们的规则中：
rule all:
    input:
        read1=expand("{output}/clipped_and_trimmed_reads/{sample}_R1.fq.gz", output=config["output"], sample=samples),
        read2=expand("{output}/clipped_and_trimmed_reads/{sample}_R2.fq.gz", output=config["output"], sample=samples)

请注意，我们不再将sample作为通配符，但我们将其“扩展”到read1和read2中，从而实现输出和sample的所有可能组合
然而，我们只完成了一半。。如果我们像这样调用Snakemake，它将确切地知道我们想要哪个输出，以及哪个规则可以生成这个输出（规则clip_和规则trim_读取）。然而，它仍然不知道在哪里可以找到这些文件。幸运的是，我们已经有了一本字典来存储这些内容（存储在不同的位置）
现在一切都应该正常了！！甚至更好；由于规则clip_和trim_读取的所有结果都写入一个文件夹，因此从这里继续应该更容易
p、 我还没有测试过任何代码，所以可能不是所有的东西在第一次尝试时都能工作。然而，信息仍然存在 谢谢，这真的很有帮助。最后一个问题。是否有任何方法可以修改rule all以维护输出的文件夹结构{output}/clipped_和{U trimmed_reads/{set}/{sample}\u R1.fq.gz？我试图在输入中使用lambda函数，但这不起作用。只有输入可以是函数，因此lambda永远不会用于输出。我不完全明白你的意思。当您的输出仅仅是read1=“{output}/clipped_and_trimmed_reads/{sample}\u R1.fq.gz”，read2=“{output}/clipped_and_trimmed_reads/{sample}\u R2.fq.gz”时会发生什么？对不起，我不清楚。我希望输出也保持原始文件夹结构，而不是将所有输出文件放在一个平面文件夹中。但我已经弄明白了，谢谢！事实上，我现在也在为此挣扎！我有一个列表对象，其文件名为fqlist=['/path/file1.fastq'，'/path/file2.fastq'，'/path/file3.fastq']。我正在尝试将该对象fqlist设置为通配符（我希望映射规则（bowtie2）的输入为输入：lambda通配符：expand（{fqfile}'，fqfile=fqlist）[wildcards.fqlist]。但这会导致错误“InputFunctionException.AttributeError:'通配符'对象没有属性'fqlist'。通配符：（后面是一个空行，好像我没有空行）。“有人知道如何定义它吗？谢谢你！@msimmer92如果你开始一个新问题会更好：）
rule clip_and_trim_reads:
    input:
        read1=lambda wildcards: expand("{data}/{set}/{sample}_R1.fq.gz", data=config["raw_data"], set=locations[wildcards.sample], sample=wildcards.sample),
        read2=lambda wildcards: expand("{data}/{set}/{sample}_R2.fq.gz", data=config["raw_data"], set=locations[wildcards.sample], sample=wildcards.sample)
    output:
        ...