Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/google-cloud-platform/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Google cloud platform 无法创建数据流模板,因为Scrapinghub客户端库没有';不接受价值提供者_Google Cloud Platform_Dataflow_Scrapinghub_Apache Beam - Fatal编程技术网

Google cloud platform 无法创建数据流模板,因为Scrapinghub客户端库没有';不接受价值提供者

Google cloud platform 无法创建数据流模板,因为Scrapinghub客户端库没有';不接受价值提供者,google-cloud-platform,dataflow,scrapinghub,apache-beam,Google Cloud Platform,Dataflow,Scrapinghub,Apache Beam,我正在尝试创建一个数据流模板,该模板可以从由pubsub消息触发的云函数调用。pubsub消息将一个作业id从(刮片机平台)发送到一个云函数,该函数触发一个数据流模板,该模板的输入是作业id,输出是BigQuery的相应数据。此设计的所有其他步骤都已完成,但我无法创建模板,因为Scrapinghub的客户端库与ApacheBeam之间可能不兼容 守则: from __future__ import absolute_import import argparse import logging im

我正在尝试创建一个数据流模板,该模板可以从由pubsub消息触发的云函数调用。pubsub消息将一个作业id从(刮片机平台)发送到一个云函数,该函数触发一个数据流模板,该模板的输入是作业id,输出是BigQuery的相应数据。此设计的所有其他步骤都已完成,但我无法创建模板,因为Scrapinghub的客户端库与ApacheBeam之间可能不兼容

守则:

from __future__ import absolute_import
import argparse
import logging
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from scrapinghub import ScrapinghubClient
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.value_provider import StaticValueProvider


class UserOptions(PipelineOptions):
@classmethod
def _add_argparse_args(cls, parser):
    parser.add_value_provider_argument('--input')
    parser.add_value_provider_argument('--output', type=str)


class IngestionBQ:
    def __init__(self): pass

    @staticmethod
    def parse_method(item):
        dic = {k: item[k] for k in item if k not in [b'_type', b'_key']}
        new_d = {}
        for key in dic:
            try: 
                new_d.update({key.decode("utf-8"): dic[key].decode("utf-8")})
            except AttributeError:
                new_d.update({key.decode("utf-8"): dic[key]})
        yield new_d          


class ShubConnect():
    def __init__(self, api_key, job_id):
        self.job_id = job_id
        self.client = ScrapinghubClient(api_key)

    def get_data(self):
        data = []
        item = self.client.get_job(self. job_id)
        for i in item.items.iter():
            data.append(i)
        return data


def run(argv=None, save_main_session==True):
    """The main function which creates the pipeline and runs it."""
    data_ingestion = IngestionBQ()
    pipeline_options = PipelineOptions()
    p = beam.Pipeline(options=pipeline_options)
    api_key = os.environ.get('api_key')
    user_options = pipeline_options.view_as(UserOptions)
    (p
        | 'Read Data from Scrapinghub' >> beam.Create(ShubConnect(api_key, user_options.input).get_data())
        | 'Trim b string' >> beam.FlatMap(data_ingestion.parse_method)
        | 'Write Projects to BigQuery' >> beam.io.WriteToBigQuery(
                user_options.output,
                schema=schema,
                # Creates the table in BigQuery if it does not yet exist.
                create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                write_disposition=beam.io.BigQueryDisposition.WRITE_EMPTY)
     )
    p.run()


if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()
我在cloud shell中使用以下命令部署模板:

python main.py 
--project=project-name 
--region=us-central1 
--runner=DataflowRunner  
--temp_location gs://temp/location/
--template_location gs://templates/location/ 
错误出现了:

Traceback (most recent call last):
  File "main.py", line 69, in <module>
    run()
  File "main.py", line 57, in run
    | 'Write Projects to BigQuery' >> beam.io.WriteToBigQuery(
  File "main.py", line 41, in get_data
    item = self.client.get_job(self. job_id)
  File "/home/user/data-flow/venv/lib/python3.7/site-packages/scrapinghub/client/__init__.py", line 99, in get_job
    project_id = parse_job_key(job_key).project_id
  File "/home/user/data-flow/venv/lib/python3.7/site-packages/scrapinghub/client/utils.py", line 60, in parse_job_key
    .format(type(job_key), repr(job_key)))
ValueError: Job key should be a string or a tuple, got <class 'apache_beam.options.value_provider.RuntimeValueProvider'>: <apache_beam.options.value_provider.RuntimeValueProvider object at 0x7f1
4760a3630>
回溯(最近一次呼叫最后一次):
文件“main.py”,第69行,在
运行()
运行中的文件“main.py”,第57行
|“将项目写入BigQuery”>>beam.io.WriteToBigQuery(
get_数据中第41行的文件“main.py”
item=self.client.get\u作业(self.job\u id)
文件“/home/user/data flow/venv/lib/python3.7/site packages/scrapinghub/client/_init__.py”,第99行,在get_作业中
project\u id=parse\u job\u key(job\u key)。project\u id
文件“/home/user/data flow/venv/lib/python3.7/site packages/scrapinghub/client/utils.py”,第60行,在parse_job_key中
.格式(类型(作业密钥)、报告(作业密钥)))

ValueError:作业键应该是字符串或元组,get:

阅读文档后,我了解发生错误的原因是不支持非I/O模块的ValueProvider对象。参考:


因此,为了实现我需要做的事情,我可以切换到JavaSDK,或者提出另一个想法。但在非I/O模块支持
ValueProvider
之前,这条路是死路一条。

Hello@pa nguyen,我想你在最常见的研究中是对的,组成数据流模板,
ValueProvider
对于非I/O模块是不可接受的,然而,即使是与pythonsdk库特别相关的I/O转换,这仍然是一个悬而未决的问题,这将解释更多。你能给出答案,总结你的调查结果吗?