Python 从云函数触发数据流作业时dill出错问题_Python_Google Cloud Functions_Google Cloud Dataflow_Google Cloud Pubsub_Dill

Python 从云函数触发数据流作业时dill出错问题

python google-cloud-dataflow

Python 从云函数触发数据流作业时dill出错问题,python,google-cloud-functions,google-cloud-dataflow,google-cloud-pubsub,dill,Python,Google Cloud Functions,Google Cloud Dataflow,Google Cloud Pubsub,Dill,我正在编写一个GCP cloud函数，它从pubsub消息、进程获取输入id，并将表输出到BigQuery 代码如下： from __future__ import absolute_import import base64 import apache_beam as beam from apache_beam.options.pipeline_options import PipelineOptions from scrapinghub import ScrapinghubClient imp

我正在编写一个GCP cloud函数，它从pubsub消息、进程获取输入id，并将表输出到BigQuery

代码如下：

from __future__ import absolute_import
import base64
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from scrapinghub import ScrapinghubClient
import os


def processing_data_function():
    # do stuff and return desired data 

def create_data_from_id():
    # take scrapinghub's job id and extract the data through api 

def run(event, context):
    """Triggered from a message on a Cloud Pub/Sub topic.
    Args:
         event (dict): Event payload.
         context (google.cloud.functions.Context): Metadata for the event.
    """
    # Take pubsub message and also Scrapinghub job's input id 
    pubsub_message = base64.b64decode(event['data']).decode('utf-8')  

    agrv = ['--project=project-name', 
            '--region=us-central1', 
            '--runner=DataflowRunner', 
            '--temp_location=gs://temp/location/', 
            '--staging_location=gs://staging/location/']
    p = beam.Pipeline(options=PipelineOptions(agrv))
    (p
        | 'Read from Scrapinghub' >> beam.Create(create_data_from_id(pubsub_message))
        | 'Trim b string' >> beam.FlatMap(processing_data_function)
        | 'Write Projects to BigQuery' >> beam.io.WriteToBigQuery(
                'table_name',
                schema=schema,
                # Creates the table in BigQuery if it does not yet exist.
                create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
    )
    p.run()


if __name__ == '__main__':
    run()

请注意，有两个函数

从\u id创建\u数据\u

和

处理\u数据\u函数

处理来自Scrapinghub（scrapy的一个抓取站点）的数据，它们非常长，因此我不想在这里包括它们。它们也与错误无关，因为如果我从cloud shell运行该代码，并使用

argparse.ArgumentParser（）

传递参数，则该代码可以工作

关于我遇到的错误，虽然部署代码没有问题，并且pubsub消息可以成功触发函数，但数据流作业失败并报告了此错误：

"Error message from worker: Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/apache_beam/internal/pickler.py", line 279, in loads
    return dill.loads(s)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 275, in loads
    return load(file, ignore, **kwds)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 270, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 472, in load
    obj = StockUnpickler.load(self)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 826, in _import_module
    return __import__(import_name)
ModuleNotFoundError: No module named 'main'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 649, in do_work
    work_executor.execute()
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 179, in execute
    op.start()
  File "apache_beam/runners/worker/operations.py", line 662, in apache_beam.runners.worker.operations.DoOperation.start
  File "apache_beam/runners/worker/operations.py", line 664, in apache_beam.runners.worker.operations.DoOperation.start
  File "apache_beam/runners/worker/operations.py", line 665, in apache_beam.runners.worker.operations.DoOperation.start
  File "apache_beam/runners/worker/operations.py", line 284, in apache_beam.runners.worker.operations.Operation.start
  File "apache_beam/runners/worker/operations.py", line 290, in apache_beam.runners.worker.operations.Operation.start
  File "apache_beam/runners/worker/operations.py", line 611, in apache_beam.runners.worker.operations.DoOperation.setup
  File "apache_beam/runners/worker/operations.py", line 616, in apache_beam.runners.worker.operations.DoOperation.setup
  File "/usr/local/lib/python3.7/site-packages/apache_beam/internal/pickler.py", line 283, in loads
    return dill.loads(s)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 275, in loads
    return load(file, ignore, **kwds)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 270, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 472, in load
    obj = StockUnpickler.load(self)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 826, in _import_module
    return __import__(import_name)
ModuleNotFoundError: No module named 'main'

我试过的考虑到我可以从cloud shell运行相同的管道，但使用参数解析器而不是指定选项，我认为选项的说明方式是问题所在。因此，我尝试了不同的选项组合，有或没有

--save_main_session

，

--staging_location

，

--requirement_file=requirements.txt

，

--setup_file=setup.py

。。。他们都或多或少地报告了相同的问题，所有的迪尔都不知道该选哪个模块。指定了

save\u main\u session

后，无法拾取主会话。由于指定了requirement_文件和setup_文件，作业甚至没有成功创建，因此我可以省去您查看其错误的麻烦。我的主要问题是我不知道这个问题是从哪里来的，因为我以前从未使用过dill，为什么从shell和云函数运行管道会如此不同？有人有线索吗

谢谢

您也可以尝试将最后一部分修改为，并测试以下各项是否有效：

if __name__ == "__main__":
    ...

此外，请确保在正确的文件夹中执行脚本，因为它可能与文件在目录中的命名或位置有关

请考虑以下来源，这些来源可能会对您有所帮助：

我希望这些信息能有所帮助。

您可能正在使用gunicorn在云上运行应用程序（作为一种标准做法），如：

CMD exec gunicorn--bind:$PORT--workers 1--threads 8--timeout 0 main:app

我遇到了同样的问题，并找到了一个解决方法，即在没有gunicorn的情况下启动应用程序：

CMD exec python3 main.py

可能是因为gunicorn跳过了main上下文，直接启动main:app对象。我不知道如何使用gunicorn修复它

==附加说明===

我找到了一种使用gunicorn的方法

将函数（启动管道）移动到另一个模块，如

df_pipeline/pipe.py

在与

main.py

在

df\u pipeline/pipe.py

中将管道选项

setup\u file

设置为

/setup.py

不确定，但接下来要运行一个

”

如果uuuu name\uuuuu==“\uuuu main\uuuuu”：run（）”

错误发生在我在GCP上调用部署的云函数时，而不是在我使用命令行运行脚本时，因此我认为这不是因为文件的位置。云函数在GCPAre上也具有普通云函数的典型结构。你仍然面临这个问题吗？啊，你使用的是云函数，而不是云运行。但可能根本原因是一样的。

.
├── df_pipeline
│   ├── __init__.py
│   └── pipe.py
├── Dockerfile
├── main.py
├── requirements.txt
└── setup.py

# in main.py
import df_pipeline as pipe
result = pipe.preprocess(....)

# setup.py
import setuptools
setuptools.setup(
    name='df_pipeline',
    install_requires=[],
    packages=setuptools.find_packages(include=['df_pipeline']),
)