Google app engine MapReduce（）比for循环花费的时间更长_Google App Engine_Mapreduce

Google app engine MapReduce（）比for循环花费的时间更长

google-app-engine mapreduce

Google app engine MapReduce（）比for循环花费的时间更长,google-app-engine,mapreduce,Google App Engine,Mapreduce,故事我支持导出数据的GAE webapp。目前，它使用for循环将ndb表转储到csv文件中。for循环在这一点上花费的时间太长~25分钟，并且在作业需要移动到其他机器之前并不总是完成。我试图使用MapReduce缩短作业的时间，但我的MapReduce作业运行了数小时，没有任何显著的输出、错误、日志等。映射程序没有完成，它甚至没有尝试写入BQ。我肯定我错过了什么。任何建议都会有帮助代码更改了一些变量名以保护其身份原始For循环 more = True cursor = None form

故事

我支持导出数据的GAE webapp。目前，它使用for循环将ndb表转储到csv文件中。for循环在这一点上花费的时间太长~25分钟，并且在作业需要移动到其他机器之前并不总是完成。我试图使用MapReduce缩短作业的时间，但我的MapReduce作业运行了数小时，没有任何显著的输出、错误、日志等。映射程序没有完成，它甚至没有尝试写入BQ。我肯定我错过了什么。任何建议都会有帮助

代码更改了一些变量名以保护其身份

原始For循环

more = True
cursor = None
formatted_rows = []
query = NDBTable.query()
while more:
    rows, cursor, more = query.fetch_page(page_size=5000, start_cursor=cursor)
    try:
        formatted_rows += [datastore_map(row) for row in rows]
    except Timeout, e:
        error_msg = "{}\nThe Datastore timed out while trying to map the row export for \n{}".format(repr(e), rows)
    logging.error(error_msg)


filename = '/filepath'
gcs_file = gcs.open(filename, 'w', content_type='text/csv')

output = StringIO.StringIO()
output.write('Column headers')
output.write('\n')
for row in formatted_rows:
    output.write(str(row))

if len(output.getvalue()) > 0:
    gcs_file.write(output.getvalue())
    output.close()
    gcs_file.close()

数据存储图

def datastore_map(entity_type):
    try:
        data = entity_type.to_dict()
    except ValueError, e:
        error_msg = "Problem loading entity to dict: {e}".format(e=repr(e))
        logging.error(error_msg)
        yield ''

    try:
        value6 = OtherNDBTable.query(OtherNDBTable.value == data.value)
    except AttributeError, e:
        warn_msg = "Could not get the value for row {row_dict}\n{msg}".format(row_dict=data, msg=repr(e))
        value6 = ""
        logging.warning(warn_msg)

    try:
        result_list = [
            data.get('value1'),
            data.get('value2'),
            data.get('value3'),
            data.get('value4'),
            data.get('value5'),
            value6
        ]
    except Exception, e:
        logging.warning("Other Exception: {} for \n {}".format(repr(e), data))
        yield ''
    result = ','.join(['"%s"' % field for field in result_list])
    yield "%s\n" % result

MapReduce管道

class DatastoreMapperPipeline(base_handler.PipelineBase):
    def run(self, entity_type):
        outputs = yield mapreduce_pipeline.MapperPipeline(
            "Datastore Mapper %s" % entity_type,
            "main.datastore_map",
            "mapreduce.input_readers.DatastoreInputReader",
            output_writer_spec="mapreduce.output_writers.FileOutputWriter",
            params={
                "input_reader": {
                    "entity_kind": entity_type,
                },
                "output_writer": {
                    "filesystem": "gs",
                    "gs_bucket_name": GS_BUCKET,
                    "output_sharding": "none",
                }
            },
            shards=X) #X has been 10, 36, and 500 with no difference
        yield CloudStorageToBigQuery(outputs) # Doesn't get here

    def finalized(self):
        logging.debug("Pipeline {} has finished with outputs {}".format(self.pipeline_id, self.outputs))

结束

应用程序引擎日志仅启动启动作业的url返回的代码为200。任何日志中都不会显示其他内容。MapReduce仪表板显示正在运行的作业以及正在运行作业的所有碎片。不过，每个碎片的最后一个工作项是未知的，即使运行的总时间为小时，它的时间也只有几秒钟。如果您需要任何其他帮助回答张贴，让我知道。

提前感谢您的帮助。

您在handler中给管道上的start打电话了吗？您好，我是Mario，谷歌云平台的代表。您是否能够解决此问题，或者仍然需要帮助？谢谢。实际上我需要帮助。我有一个200 MB的文件，我还使用一个组合器来加快双核4GB系统的进程。我的绘图程序运行了一个多小时。我看到记录生成了，但速度太慢了。已设置mapreduce.task.io.sort.mb=300，mapreduce.job.split.metainfo.maxsize=10000000，mapred.reduce.parallel.copies=2。但请帮忙是徒劳的。