Google bigquery Beam作业创建BigQuery表,但不插入

Google bigquery Beam作业创建BigQuery表,但不插入,google-bigquery,apache-beam,Google Bigquery,Apache Beam,我正在编写一个beam作业,它是一个简单的1:1 ETL,它将存储在GCS中的二进制protobuf文件转换为BigQuery。表模式相当大,并且是从一个具有代表性的protobuf自动生成的 我遇到的行为是成功创建了BigQuery表,但没有插入任何记录。我已经确认记录是由早期阶段生成的,当我使用普通文件接收器时,我可以确认记录已写入 有人知道为什么会这样吗 日志: 警告:根目录:推断架构。。。 警告:root:找不到要使用的默认凭据:应用程序默认凭据不可用。如果在谷歌计算引擎中运行,它们是可

我正在编写一个beam作业,它是一个简单的1:1 ETL,它将存储在GCS中的二进制protobuf文件转换为BigQuery。表模式相当大,并且是从一个具有代表性的protobuf自动生成的

我遇到的行为是成功创建了BigQuery表,但没有插入任何记录。我已经确认记录是由早期阶段生成的,当我使用普通文件接收器时,我可以确认记录已写入

有人知道为什么会这样吗

日志:

警告:根目录:推断架构。。。
警告:root:找不到要使用的默认凭据:应用程序默认凭据不可用。如果在谷歌计算引擎中运行,它们是可用的。否则,必须定义指向定义凭据的文件的环境变量GOOGLE\u APPLICATION\u CREDENTIALS。看见https://developers.google.com/accounts/docs/application-default-credentials 了解更多信息。
匿名连接。
警告:根目录:定义梁管道。。。
/venv/lib/python3.7/site packages/apache_beam/io/gcp/bigquery.py:1145:BeamDeprecationWarning:options自第一个稳定版本以来已被弃用。不支持对.选项的引用
实验=p.options.将_视为(调试选项)。实验或[]
警告:根目录:正在运行梁管道。。。
警告:root:extracted{'counters':[MetricResult(key=MetricKey(step=extract\u games,metric=MetricName(namespace=\uu main\uuu.extractgamesprotobuf,name=extracted\u games),labels={}),committed=8,未遂=8)],'distributions':[],'games games
管道来源:

def main(args):
    DEFAULT_REPLAY_IDS_PATH = "./replay_ids.txt"

    DEFAULT_BQ_TABLE_OUT = "<PROJECT REDACTED>:<DATASET REDACTED>.games"

    # configure logging
    logging.basicConfig(level=logging.WARNING)

    # set up replay source
    replay_source = ETLReplayRemoteSource.default()

    # TODO: load the example replay and parse schema
    logging.warning("Inferring Schema...")
    sample_replay = replay_source.load_replay(DEFAULT_REPLAY_IDS[0])
    game_schema = ProtobufToBigQuerySchemaGenerator(
        sample_replay.analysis.DESCRIPTOR).schema()
    # print("GAME SCHEMA:\n{}".format(game_schema))  # DEBUG

    # submit beam job that reads replays into bigquery

    def count_ones(word_ones):
        (word, ones) = word_ones
        return (word, sum(ones))

    with beam.Pipeline(options=PipelineOptions()) as p:
        logging.warning("Defining Beam Pipeline...")
        # replay_ids = p | "create_replay_ids" >> beam.Create(DEFAULT_REPLAY_IDS)
        (p | "read_replay_ids" >> beam.io.ReadFromText(DEFAULT_REPLAY_IDS_PATH)
           | "extract_games" >> beam.ParDo(ExtractGameProtobuf())
           | "write_out_bq" >> WriteToBigQuery(
            DEFAULT_BQ_TABLE_OUT,
            schema=game_schema,
            write_disposition=BigQueryDisposition.WRITE_APPEND,
            create_disposition=BigQueryDisposition.CREATE_IF_NEEDED)
         )

        logging.warning("Running Beam Pipeline...")
        result = p.run()
        result.wait_until_finish()
        n_extracted = result.metrics().query(
            MetricsFilter().with_name('extracted_games'))
        logging.warning("extracted {} games".format(n_extracted))
def main(args):
默认的重播\u-IDS\u-PATH=“./REPLAY\u-IDS.txt”
默认_BQ_TABLE_OUT=“:.games”
#配置日志记录
logging.basicConfig(级别=logging.WARNING)
#设置重播源
replay_source=ETLReplayRemoteSource.default()
#TODO:加载示例重播和解析模式
警告(“推断架构…”)
示例重播=重播源。加载重播(默认重播ID[0])
game_schema=ProtobufToBigQuerySchemaGenerator(
示例_replay.analysis.DESCRIPTOR).schema()
#打印(“游戏模式:\n{}”。格式(游戏模式))#调试
#提交将Replay读入bigquery的beam作业
def计数(单词):
(单词,一个)=单词
报税表(字数、总数(个))
将beam.Pipeline(options=PipelineOptions())作为p:
日志记录。警告(“定义梁管道…”)
#replay_id=p |“创建_replay_id”>>beam.create(默认的_replay_id)
(p |“读取重放ID”>>beam.io.ReadFromText(默认的重放ID路径)
|“extract_games”>>beam.ParDo(ExtractGameProtobuf())
|“写出来”>>WriteToBigQuery(
默认_BQ_表_OUT,
模式=游戏模式,
write\u disposition=BigQueryDisposition.write\u APPEND,
create_disposition=BigQueryDisposition。如果需要,请创建)
)
日志记录。警告(“运行梁管道…”)
结果=p.运行()
结果。等待直到完成()
n_extracted=result.metrics().query(
MetricsFilter()。具有\u名称('extracted\u games'))
logging.warning(“提取的{}games.format(n_提取))

如何检查记录是否写入BigQuery?是否在检查前等待作业完成?在BigQuery中检查项目的作业,是否看到正在启动加载作业?
def main(args):
    DEFAULT_REPLAY_IDS_PATH = "./replay_ids.txt"

    DEFAULT_BQ_TABLE_OUT = "<PROJECT REDACTED>:<DATASET REDACTED>.games"

    # configure logging
    logging.basicConfig(level=logging.WARNING)

    # set up replay source
    replay_source = ETLReplayRemoteSource.default()

    # TODO: load the example replay and parse schema
    logging.warning("Inferring Schema...")
    sample_replay = replay_source.load_replay(DEFAULT_REPLAY_IDS[0])
    game_schema = ProtobufToBigQuerySchemaGenerator(
        sample_replay.analysis.DESCRIPTOR).schema()
    # print("GAME SCHEMA:\n{}".format(game_schema))  # DEBUG

    # submit beam job that reads replays into bigquery

    def count_ones(word_ones):
        (word, ones) = word_ones
        return (word, sum(ones))

    with beam.Pipeline(options=PipelineOptions()) as p:
        logging.warning("Defining Beam Pipeline...")
        # replay_ids = p | "create_replay_ids" >> beam.Create(DEFAULT_REPLAY_IDS)
        (p | "read_replay_ids" >> beam.io.ReadFromText(DEFAULT_REPLAY_IDS_PATH)
           | "extract_games" >> beam.ParDo(ExtractGameProtobuf())
           | "write_out_bq" >> WriteToBigQuery(
            DEFAULT_BQ_TABLE_OUT,
            schema=game_schema,
            write_disposition=BigQueryDisposition.WRITE_APPEND,
            create_disposition=BigQueryDisposition.CREATE_IF_NEEDED)
         )

        logging.warning("Running Beam Pipeline...")
        result = p.run()
        result.wait_until_finish()
        n_extracted = result.metrics().query(
            MetricsFilter().with_name('extracted_games'))
        logging.warning("extracted {} games".format(n_extracted))