Google bigquery Beam作业创建BigQuery表，但不插入_Google Bigquery_Apache Beam

Google bigquery Beam作业创建BigQuery表，但不插入

google-bigquery

Google bigquery Beam作业创建BigQuery表，但不插入,google-bigquery,apache-beam,Google Bigquery,Apache Beam,我正在编写一个beam作业，它是一个简单的1:1 ETL，它将存储在GCS中的二进制protobuf文件转换为BigQuery。表模式相当大，并且是从一个具有代表性的protobuf自动生成的我遇到的行为是成功创建了BigQuery表，但没有插入任何记录。我已经确认记录是由早期阶段生成的，当我使用普通文件接收器时，我可以确认记录已写入有人知道为什么会这样吗日志：警告：根目录：推断架构。。。警告：root:找不到要使用的默认凭据：应用程序默认凭据不可用。如果在谷歌计算引擎中运行，它们是可

我正在编写一个beam作业，它是一个简单的1:1 ETL，它将存储在GCS中的二进制protobuf文件转换为BigQuery。表模式相当大，并且是从一个具有代表性的protobuf自动生成的

我遇到的行为是成功创建了BigQuery表，但没有插入任何记录。我已经确认记录是由早期阶段生成的，当我使用普通文件接收器时，我可以确认记录已写入

有人知道为什么会这样吗

日志：

警告：根目录：推断架构。。。
警告：root:找不到要使用的默认凭据：应用程序默认凭据不可用。如果在谷歌计算引擎中运行，它们是可用的。否则，必须定义指向定义凭据的文件的环境变量GOOGLE\u APPLICATION\u CREDENTIALS。看见https://developers.google.com/accounts/docs/application-default-credentials 了解更多信息。
匿名连接。
警告：根目录：定义梁管道。。。
/venv/lib/python3.7/site packages/apache_beam/io/gcp/bigquery.py:1145:BeamDeprecationWarning:options自第一个稳定版本以来已被弃用。不支持对.选项的引用
实验=p.options.将_视为（调试选项）。实验或[]
警告：根目录：正在运行梁管道。。。
警告：root:extracted{'counters'：[MetricResult（key=MetricKey（step=extract\u games，metric=MetricName（namespace=\uu main\uuu.extractgamesprotobuf，name=extracted\u games），labels={}），committed=8，未遂=8）]，'distributions'：[]，'games games

管道来源：

def main(args):
    DEFAULT_REPLAY_IDS_PATH = "./replay_ids.txt"

    DEFAULT_BQ_TABLE_OUT = "<PROJECT REDACTED>:<DATASET REDACTED>.games"

    # configure logging
    logging.basicConfig(level=logging.WARNING)

    # set up replay source
    replay_source = ETLReplayRemoteSource.default()

    # TODO: load the example replay and parse schema
    logging.warning("Inferring Schema...")
    sample_replay = replay_source.load_replay(DEFAULT_REPLAY_IDS[0])
    game_schema = ProtobufToBigQuerySchemaGenerator(
        sample_replay.analysis.DESCRIPTOR).schema()
    # print("GAME SCHEMA:\n{}".format(game_schema))  # DEBUG

    # submit beam job that reads replays into bigquery

    def count_ones(word_ones):
        (word, ones) = word_ones
        return (word, sum(ones))

    with beam.Pipeline(options=PipelineOptions()) as p:
        logging.warning("Defining Beam Pipeline...")
        # replay_ids = p | "create_replay_ids" >> beam.Create(DEFAULT_REPLAY_IDS)
        (p | "read_replay_ids" >> beam.io.ReadFromText(DEFAULT_REPLAY_IDS_PATH)
           | "extract_games" >> beam.ParDo(ExtractGameProtobuf())
           | "write_out_bq" >> WriteToBigQuery(
            DEFAULT_BQ_TABLE_OUT,
            schema=game_schema,
            write_disposition=BigQueryDisposition.WRITE_APPEND,
            create_disposition=BigQueryDisposition.CREATE_IF_NEEDED)
         )

        logging.warning("Running Beam Pipeline...")
        result = p.run()
        result.wait_until_finish()
        n_extracted = result.metrics().query(
            MetricsFilter().with_name('extracted_games'))
        logging.warning("extracted {} games".format(n_extracted))

def main（args）：
默认的重播\u-IDS\u-PATH=“./REPLAY\u-IDS.txt”
默认_BQ_TABLE_OUT=“：.games”
#配置日志记录
logging.basicConfig（级别=logging.WARNING）
#设置重播源
replay_source=ETLReplayRemoteSource.default（）
#TODO:加载示例重播和解析模式
警告（“推断架构…”）
示例重播=重播源。加载重播（默认重播ID[0]）
game_schema=ProtobufToBigQuerySchemaGenerator(
示例_replay.analysis.DESCRIPTOR）.schema（）
#打印（“游戏模式：\n{}”。格式（游戏模式））#调试
#提交将Replay读入bigquery的beam作业
def计数（单词）：
（单词，一个）=单词
报税表（字数、总数（个））
将beam.Pipeline（options=PipelineOptions（））作为p：
日志记录。警告（“定义梁管道…”）
#replay_id=p |“创建_replay_id”>>beam.create（默认的_replay_id）
（p |“读取重放ID”>>beam.io.ReadFromText（默认的重放ID路径）
|“extract_games”>>beam.ParDo（ExtractGameProtobuf（））
|“写出来”>>WriteToBigQuery(
默认_BQ_表_OUT，
模式=游戏模式，
write\u disposition=BigQueryDisposition.write\u APPEND，
create_disposition=BigQueryDisposition。如果需要，请创建）
)
日志记录。警告（“运行梁管道…”）
结果=p.运行（）
结果。等待直到完成（）
n_extracted=result.metrics（）.query(
MetricsFilter（）。具有\u名称（'extracted\u games'））
logging.warning（“提取的{}games.format（n_提取））

如何检查记录是否写入BigQuery？是否在检查前等待作业完成？在BigQuery中检查项目的作业，是否看到正在启动加载作业？

def main(args):
    DEFAULT_REPLAY_IDS_PATH = "./replay_ids.txt"

    DEFAULT_BQ_TABLE_OUT = "<PROJECT REDACTED>:<DATASET REDACTED>.games"

    # configure logging
    logging.basicConfig(level=logging.WARNING)

    # set up replay source
    replay_source = ETLReplayRemoteSource.default()

    # TODO: load the example replay and parse schema
    logging.warning("Inferring Schema...")
    sample_replay = replay_source.load_replay(DEFAULT_REPLAY_IDS[0])
    game_schema = ProtobufToBigQuerySchemaGenerator(
        sample_replay.analysis.DESCRIPTOR).schema()
    # print("GAME SCHEMA:\n{}".format(game_schema))  # DEBUG

    # submit beam job that reads replays into bigquery

    def count_ones(word_ones):
        (word, ones) = word_ones
        return (word, sum(ones))

    with beam.Pipeline(options=PipelineOptions()) as p:
        logging.warning("Defining Beam Pipeline...")
        # replay_ids = p | "create_replay_ids" >> beam.Create(DEFAULT_REPLAY_IDS)
        (p | "read_replay_ids" >> beam.io.ReadFromText(DEFAULT_REPLAY_IDS_PATH)
           | "extract_games" >> beam.ParDo(ExtractGameProtobuf())
           | "write_out_bq" >> WriteToBigQuery(
            DEFAULT_BQ_TABLE_OUT,
            schema=game_schema,
            write_disposition=BigQueryDisposition.WRITE_APPEND,
            create_disposition=BigQueryDisposition.CREATE_IF_NEEDED)
         )

        logging.warning("Running Beam Pipeline...")
        result = p.run()
        result.wait_until_finish()
        n_extracted = result.metrics().query(
            MetricsFilter().with_name('extracted_games'))
        logging.warning("extracted {} games".format(n_extracted))