Python 用于is直接批处理任务的luigi批处理模块_Python_Parallel Processing_Luigi

Python 用于is直接批处理任务的luigi批处理模块

python parallel-processing

Python 用于is直接批处理任务的luigi批处理模块,python,parallel-processing,luigi,Python,Parallel Processing,Luigi,我有500个链接要下载，并想批量他们的例子10个项目这个伪代码是什么样子的 class BatchJobTask(luigi.Task) items = luigi.Parameter() def run(self): listURLs = [] with ('urls_chunk', 'r') as urls for line in urls: listURLs.append('http:

我有500个链接要下载，并想批量他们的例子10个项目

这个伪代码是什么样子的

class BatchJobTask(luigi.Task)
    items = luigi.Parameter()
    def run(self):
        listURLs = []
        with ('urls_chunk', 'r') as urls
            for line in urls:
                listURLs.append('http://ggg'+line+'.org')
            10_urls = listURLs[0:items] #10 items here
            for i in 10_urls:
                req = request.get(url)
                req.contents
    def output(self):
        return self.LocalTarger("downloaded_filelist.txt")

class BatchWorker(luigi.Task)
    def run(self)
        # Here I should run BatchJobTask from 0 to 10, next 11 - 21 new etc...

它会是怎样的？

这里有一种方法，可以做一些您想要做的事情，但是字符串列表存储在文件中作为单独的行

import luigi
import requests

BATCH_SIZE = 10


class BatchProcessor(luigi.Task):
    items = luigi.ListParameter()
    max = luigi.IntParameter()

    def requires(self):
        return None

    def output(self):
        return luigi.LocalTarget('processed'+str(max)+'.txt')

    def run(self):
        for item in self.items:
            req = requests.get('http://www.'+item+'.org')
            # do something useful here
            req.contents
        open("processed"+str(max)+".txt",'w').close()


class BatchCreator(luigi.Task):
    file_with_urls = luigi.Parameter()

    def requires(self):
        required_tasks = []
        f = open(self.file_with_urls)
        batch_index = 0
        total_index = 0
        lines = []
        while True:
            line = f.readline()
            if not line: break
            total_index += 1
            if batch_index < BATCH_SIZE:
                lines.append(line)
                batch_index += 1
            else:
                required_tasks.append(BatchProcessor(batch=lines))
                lines = [line]
                batch_index = 1
        return required_tasks

    def output(self):
        return luigi.LocalTarget(str(self.file_with_urls) + 'processed')

    def run(self):
        open(str(self.file_with_urls) + 'processed', 'w').close()

导入luigi 导入请求批量大小=10 类BatchProcessor（luigi.Task）： items=luigi.ListParameter（） max=luigi.IntParameter（） def需要（自我）：一无所获 def输出（自）：返回luigi.LocalTarget（'processed'+str（max）+'.txt'） def运行（自）：对于self.items中的项目： req=请求。获取（'http://www.“+项目+”.org“） #在这里做些有用的事请求内容打开（“已处理”+str（max）+“.txt”，“w”）.close（）类BatchCreator（luigi.Task）：文件\u的URL=luigi.Parameter（） def需要（自我）：必需的_任务=[] f=打开（带有URL的self.file\u）批次索引=0 总指数=0 行=[] 尽管如此： line=f.readline（）如果不是行：断开总指数+=1 如果批次索引<批次大小：行。追加（行）批次索引+=1 其他：必需的任务。追加（批处理器（批=行））行=[行] 批次索引=1 返回所需的任务 def输出（自）：返回luigi.LocalTarget（str（带有URL的self.file）+“已处理”） def运行（自）：打开（str（self.file_与_url）+“已处理”，“w”）.close（）我这样做了

class GetListtask(luigi.Task)
    def run(self):
        ...
    def output(self):
    return luigi.LocalTarget(self.outputfile)

class GetJustOneFile(luigi.Task):
    fid = luigi.IntParameter()
    def requires(self):
        pass

    def run(self):
        url = 'http://my-server.com/test' + str(self.fid) + '.txt'
        download_file = requests.get(url, stream=True)
        with self.output().open('w') as downloaded_file:
            downloaded_file.write(str(download_file.content))

    def output(self):
        return luigi.LocalTarget("test{}.txt".format(self.fid))


class GetAllFiles(luigi.WrapperTask):
    def requires(self):
        listoffiles = []  # 0..999
        for i in range(899):
            listoffiles.append(i)
        return [GetJustOneFile(fid=fileid) for fileid in listoffiles]

这段代码糟糕吗？

你的URL列表在哪里？我已经更新了第一个帖子，意思是URL列表存储在哪里？在队列、数据库、文件中？你要做的是计算出有多少在那东西里，然后从那里建立你的块。我将在下面举个例子，但它不太可能与您的问题相关，因为您没有指定问题的相关部分。嗯，它不进行批处理，但是它应该可以工作。我如何从GetAllFiles中的GetListTask而不是预定义列表中输入文件？这就是我在我的

BatchCreator

任务的

requires

方法中显示的，假设您有一个文件，其中文件的每一行都是不同的URN组件。不要在python中使用

max

作为变量。另外，您需要参考

self.max

以获取本地参数值；此代码无法运行。