Google bigquery Beam作业创建BigQuery表,但不插入
我正在编写一个beam作业,它是一个简单的1:1 ETL,它将存储在GCS中的二进制protobuf文件转换为BigQuery。表模式相当大,并且是从一个具有代表性的protobuf自动生成的 我遇到的行为是成功创建了BigQuery表,但没有插入任何记录。我已经确认记录是由早期阶段生成的,当我使用普通文件接收器时,我可以确认记录已写入 有人知道为什么会这样吗 日志:Google bigquery Beam作业创建BigQuery表,但不插入,google-bigquery,apache-beam,Google Bigquery,Apache Beam,我正在编写一个beam作业,它是一个简单的1:1 ETL,它将存储在GCS中的二进制protobuf文件转换为BigQuery。表模式相当大,并且是从一个具有代表性的protobuf自动生成的 我遇到的行为是成功创建了BigQuery表,但没有插入任何记录。我已经确认记录是由早期阶段生成的,当我使用普通文件接收器时,我可以确认记录已写入 有人知道为什么会这样吗 日志: 警告:根目录:推断架构。。。 警告:root:找不到要使用的默认凭据:应用程序默认凭据不可用。如果在谷歌计算引擎中运行,它们是可
警告:根目录:推断架构。。。
警告:root:找不到要使用的默认凭据:应用程序默认凭据不可用。如果在谷歌计算引擎中运行,它们是可用的。否则,必须定义指向定义凭据的文件的环境变量GOOGLE\u APPLICATION\u CREDENTIALS。看见https://developers.google.com/accounts/docs/application-default-credentials 了解更多信息。
匿名连接。
警告:根目录:定义梁管道。。。
/venv/lib/python3.7/site packages/apache_beam/io/gcp/bigquery.py:1145:BeamDeprecationWarning:options自第一个稳定版本以来已被弃用。不支持对.选项的引用
实验=p.options.将_视为(调试选项)。实验或[]
警告:根目录:正在运行梁管道。。。
警告:root:extracted{'counters':[MetricResult(key=MetricKey(step=extract\u games,metric=MetricName(namespace=\uu main\uuu.extractgamesprotobuf,name=extracted\u games),labels={}),committed=8,未遂=8)],'distributions':[],'games games
管道来源:
def main(args):
DEFAULT_REPLAY_IDS_PATH = "./replay_ids.txt"
DEFAULT_BQ_TABLE_OUT = "<PROJECT REDACTED>:<DATASET REDACTED>.games"
# configure logging
logging.basicConfig(level=logging.WARNING)
# set up replay source
replay_source = ETLReplayRemoteSource.default()
# TODO: load the example replay and parse schema
logging.warning("Inferring Schema...")
sample_replay = replay_source.load_replay(DEFAULT_REPLAY_IDS[0])
game_schema = ProtobufToBigQuerySchemaGenerator(
sample_replay.analysis.DESCRIPTOR).schema()
# print("GAME SCHEMA:\n{}".format(game_schema)) # DEBUG
# submit beam job that reads replays into bigquery
def count_ones(word_ones):
(word, ones) = word_ones
return (word, sum(ones))
with beam.Pipeline(options=PipelineOptions()) as p:
logging.warning("Defining Beam Pipeline...")
# replay_ids = p | "create_replay_ids" >> beam.Create(DEFAULT_REPLAY_IDS)
(p | "read_replay_ids" >> beam.io.ReadFromText(DEFAULT_REPLAY_IDS_PATH)
| "extract_games" >> beam.ParDo(ExtractGameProtobuf())
| "write_out_bq" >> WriteToBigQuery(
DEFAULT_BQ_TABLE_OUT,
schema=game_schema,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED)
)
logging.warning("Running Beam Pipeline...")
result = p.run()
result.wait_until_finish()
n_extracted = result.metrics().query(
MetricsFilter().with_name('extracted_games'))
logging.warning("extracted {} games".format(n_extracted))
def main(args):
默认的重播\u-IDS\u-PATH=“./REPLAY\u-IDS.txt”
默认_BQ_TABLE_OUT=“:.games”
#配置日志记录
logging.basicConfig(级别=logging.WARNING)
#设置重播源
replay_source=ETLReplayRemoteSource.default()
#TODO:加载示例重播和解析模式
警告(“推断架构…”)
示例重播=重播源。加载重播(默认重播ID[0])
game_schema=ProtobufToBigQuerySchemaGenerator(
示例_replay.analysis.DESCRIPTOR).schema()
#打印(“游戏模式:\n{}”。格式(游戏模式))#调试
#提交将Replay读入bigquery的beam作业
def计数(单词):
(单词,一个)=单词
报税表(字数、总数(个))
将beam.Pipeline(options=PipelineOptions())作为p:
日志记录。警告(“定义梁管道…”)
#replay_id=p |“创建_replay_id”>>beam.create(默认的_replay_id)
(p |“读取重放ID”>>beam.io.ReadFromText(默认的重放ID路径)
|“extract_games”>>beam.ParDo(ExtractGameProtobuf())
|“写出来”>>WriteToBigQuery(
默认_BQ_表_OUT,
模式=游戏模式,
write\u disposition=BigQueryDisposition.write\u APPEND,
create_disposition=BigQueryDisposition。如果需要,请创建)
)
日志记录。警告(“运行梁管道…”)
结果=p.运行()
结果。等待直到完成()
n_extracted=result.metrics().query(
MetricsFilter()。具有\u名称('extracted\u games'))
logging.warning(“提取的{}games.format(n_提取))
如何检查记录是否写入BigQuery?是否在检查前等待作业完成?在BigQuery中检查项目的作业,是否看到正在启动加载作业?
def main(args):
DEFAULT_REPLAY_IDS_PATH = "./replay_ids.txt"
DEFAULT_BQ_TABLE_OUT = "<PROJECT REDACTED>:<DATASET REDACTED>.games"
# configure logging
logging.basicConfig(level=logging.WARNING)
# set up replay source
replay_source = ETLReplayRemoteSource.default()
# TODO: load the example replay and parse schema
logging.warning("Inferring Schema...")
sample_replay = replay_source.load_replay(DEFAULT_REPLAY_IDS[0])
game_schema = ProtobufToBigQuerySchemaGenerator(
sample_replay.analysis.DESCRIPTOR).schema()
# print("GAME SCHEMA:\n{}".format(game_schema)) # DEBUG
# submit beam job that reads replays into bigquery
def count_ones(word_ones):
(word, ones) = word_ones
return (word, sum(ones))
with beam.Pipeline(options=PipelineOptions()) as p:
logging.warning("Defining Beam Pipeline...")
# replay_ids = p | "create_replay_ids" >> beam.Create(DEFAULT_REPLAY_IDS)
(p | "read_replay_ids" >> beam.io.ReadFromText(DEFAULT_REPLAY_IDS_PATH)
| "extract_games" >> beam.ParDo(ExtractGameProtobuf())
| "write_out_bq" >> WriteToBigQuery(
DEFAULT_BQ_TABLE_OUT,
schema=game_schema,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED)
)
logging.warning("Running Beam Pipeline...")
result = p.run()
result.wait_until_finish()
n_extracted = result.metrics().query(
MetricsFilter().with_name('extracted_games'))
logging.warning("extracted {} games".format(n_extracted))