Python Luigi任务没有'；t激发ETL过程的要求_Python_Luigi

Python Luigi任务没有'；t激发ETL过程的要求

python

Python Luigi任务没有'；t激发ETL过程的要求,python,luigi,Python,Luigi,这是我的后续工作，关于循环ETL过程中要遵循的模式今天，我写的机器学习作业是手工完成的。我下载所需的输入文件，学习和预测事物，输出一个.csv文件，然后复制到数据库中然而，由于这将投入生产，我需要自动化所有这一过程。所需的输入文件将每月（并最终更频繁地）从提供者到达S3存储桶作为一个例子，我试图在Luigi中实现这一点，但将S3更改为本地目录，因此事情会更简单。这个节目应该查看输入目录中的新文件找到新文件后，通过extract任务将其解压缩到数据目录通过转换任务进行处理（使用算法功

这是我的后续工作，关于循环ETL过程中要遵循的模式

今天，我写的机器学习作业是手工完成的。我下载所需的输入文件，学习和预测事物，输出一个.csv文件，然后复制到数据库中

然而，由于这将投入生产，我需要自动化所有这一过程。所需的输入文件将每月（并最终更频繁地）从提供者到达S3存储桶

作为一个例子，我试图在Luigi中实现这一点，但将S3更改为本地目录，因此事情会更简单。这个节目应该

查看输入目录中的新文件
找到新文件后，通过
```
extract
```
任务将其解压缩到数据目录
通过
```
转换
```
任务进行处理（使用
```
算法
```
功能）
它在
```
Load
```
任务中加载到PostgreSQL数据库

问题:

当我将一个文件放入输入目录并启动加载任务时，它不会启动提取任务。我做错了什么
是否每次使用此
```
update\u id
```
参数时都会触发加载

当文件存在于

result/

目录中时，您的

Load

任务仅创建

Transform

任务。它不应该在

input

目录中查找新文件吗

为什么代码用4个空格缩进，并且没有显示为格式化代码？因为在第4个项目符号中它被假定为块的一部分。谢谢，这解释了问题，仍然没有激发需求。在更改我正在使用的真实文件夹的名称时，这是一个输入错误

import glob

import luigi
from luigi.contrib import postgres
import pandas as pd


class ReadFile(luigi.ExternalTask):
    # Simply load the new file from input directory 
    filename = luigi.Parameter()
    def output(self):
        return luigi.hdfs.LocalTarget('input/' + self.filename)


class Extract(luigi.Task):
    # Extract from input directory and put in the data directory
    filename = luigi.Parameter()
    def requires(self):
        return ReadFile(self.filename)
    def output(self):
        return luigi.hdfs.LocalTarget('data/' + self.filename)
    def run(self):
        with self.input().open('r') as input_file:
            data = input_file.read()
        with self.output().open('w') as output_file:
            write(output_file, data)


class Transform(luigi.Task):
    # Transform the file from data directory using the transform function
    filename = luigi.Parameter()
    def requires(self):
        return Extract(self.filename)
    def output(self, filename):
        return luigi.hdfs.LocalTarget('results/' + self.filename)
    def run(self):
        with self.input().open('r') as input_file:
            data = input_file.read()
        result = trasnform(data)
        with self.output().open('w') as output_file:
            result.to_csv(output_file)
        mark_as_done(self.filename)


class Load(luigi.Task):
    # Find new files, run the Transform function and load into the PostgreSQL DB
    date = luigi.DateParameter()
    def requires(self):
        return [Transform(filename) for filename in new_files('input/')]
    def output(self):
        return postgres.PostgresTarget(host='db', database='luigi', user='luigi', password='luigi', table='test', update_id=self.date)
    def run(self):
        for input in self.input():
            with input.open('r') as inputfile:
                result = pd.read_csv(inputfile)
            connection = self.output().connect()
            for row in result.itertuples():
                cursor = connection.cursor()
                cursor.execute('INSERT INTO test VALUES (?,?)', row)

# Get connection to the SQLite DB, which will store the files that were already processed
SQLITE_CONNECTION = None
def get_connection():
    if SQLITE_CONNECTION is None:
         SQLITE_CONNECTION = sqlite3.connect('processed.db')
    return SQLITE_CONNECTION


# Mark filename as done in the SQLite DB
def mark_as_done(filename):
    connection = get_connection()
    cursor = connection.cursor()
    cursor.execute('INSERT INTO processed_files VALUES (?)', (filename,))


# Check of the file were already processed
def new_file(filename):
    connection = get_connection()
    cursor = connection.cursor()
    cursor.execute('SELECT * FROM processed_files WHERE file=%s', (filename,))
    return cursor.rowcount == 0


# Yields filenames of files that were not processed yet
def new_files(path):
    for filename in glob.glob(path + '*.csv'):
        if new_file(filename):
            yield filename


# Mock of the transform process
def trasnform(data):
    return pd.DataFrame({'a': [1,2,3], 'b': [1,2,3]})