Google bigquery Bigquery存储API多处理故障_Google Bigquery_Multiprocessing

Google bigquery Bigquery存储API多处理故障

google-bigquery

Google bigquery Bigquery存储API多处理故障,google-bigquery,multiprocessing,Google Bigquery,Multiprocessing,长期读者，第一次海报。我使用的是BigQuery存储API Python客户端库，在使用Python多处理分离阅读器时遇到了一些问题文件中包含一个注释，说明：因为此客户端使用grpcio库，所以共享实例是安全的跨越线程。在多处理场景中，最佳实践是在调用os.fork（）之后创建客户端实例 multiprocessing.Pool或multiprocessing.Process 我想我这样做是对的…但我不能这样做这是我目前的代码。目标是在多个并行流中读取BQ表，然后将数据行写入单个CSV

长期读者，第一次海报。我使用的是BigQuery存储API Python客户端库，在使用Python多处理分离阅读器时遇到了一些问题

文件中包含一个注释，说明：

因为此客户端使用grpcio库，所以共享实例是安全的跨越线程。在多处理场景中，最佳实践是在调用os.fork（）之后创建客户端实例 multiprocessing.Pool或multiprocessing.Process

我想我这样做是对的…但我不能这样做

这是我目前的代码。目标是在多个并行流中读取BQ表，然后将数据行写入单个CSV文件。一旦创建了所有CSV文件，我将执行一个简单的cat命令来组合它们

作为旁注，这段代码实际上适用于小的BigQuery表，但在尝试下载大的BQ表时，SEGFULT会失败

import faulthandler
faulthandler.enable()
from google.cloud.bigquery_storage import BigQueryReadClient
from google.cloud.bigquery_storage import types
import multiprocessing as mp
import psutil
import os
import sys
import csv
from datetime import datetime


def extract_table(i):

    client_in = BigQueryReadClient()
    reader_in = client_in.read_rows(session.streams[i].name, timeout=10000)

    rows = reader_in.rows(session)

    csv_file = "/home/user/sas/" + table_name + "_" + str(i) + ".csv"
    print(f"Starting at time {datetime.now()} for file {csv_file}")

    try:
        with open(csv_file, 'w') as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
            if i == 0:
                writer.writeheader()
            else:
                pass
            for data in rows:
                # print(data)
                writer.writerow(data)
    except IOError:
        print("I/O error")

    print(f"Finished at time {datetime.now()} for file {csv_file}")
    return


if __name__ == '__main__':
    # Get input args
    project_id = sys.argv[1]
    db_name = sys.argv[2]
    table_name = sys.argv[3]

    n = len(sys.argv[4])
    a = sys.argv[4][1:n - 1]
    csv_columns = a.replace("'", '').split(', ')

    output_type = sys.argv[5]  # csv or sas
    bucket_root = sys.argv[6]

    # The read session is created in this project. This project can be
    # different from that which contains the table.
    client = BigQueryReadClient()

    table = "projects/{}/datasets/{}/tables/{}".format(
        project_id, db_name, table_name
    )

    requested_session = types.ReadSession()
    requested_session.table = table
    
    # This API can also deliver data serialized in Apache Arrow format.
    # This example leverages Apache Avro.
    requested_session.data_format = types.DataFormat.AVRO

    # We limit the output columns to a subset of those allowed in the table
    requested_session.read_options.selected_fields = csv_columns
    
    ncpus = psutil.cpu_count(logical=False)

    if ncpus <= 2:
        ncpus_buffer = 2
    else:
        ncpus_buffer = ncpus - 2

    print(f"You have {ncpus} cores according to psutil. Using {ncpus_buffer} cores")

    parent = "projects/{}".format(project_id)
    session = client.create_read_session(
        parent=parent,
        read_session=requested_session,
        max_stream_count=ncpus_buffer,
    )

    print(f"There are {len(session.streams)} streams")

    num_streams = int(len(session.streams))

    with mp.Pool(processes=ncpus_buffer) as p:
        result = p.map(extract_table, list(range(0, num_streams)), chunksize=1)

同样，这适用于小表，并且有几次我让它适用于50-100GB大小范围内的非常大的BQ表。但是，大多数情况下，大型表都会出现以下错误：

有1000条小溪根据psutil，您有2个内核。使用时间2020-11-17 17:46:04.645398开始的2个内核进行文件存储 /主页/用户/sas/diag_0.csv

从2020年11月17日开始 17:46:04.829381用于文件/home/user/sas/diag_1.csv

致命的Python错误：分段错误

线程0x00007f4293f94700（最新调用优先）：文件 “/home/user/anaconda3/envs/sas controller/lib/python3.8/site packages/grpc/_channel.py”，通道旋转文件中的第1235行 “/home/user/anaconda3/envs/sas controller/lib/python3.8/threading.py”，运行文件中的第870行 “/home/user/anaconda3/envs/sas controller/lib/python3.8/threading.py”， _bootstrap_内部文件中的第932行 “/home/user/anaconda3/envs/sas controller/lib/python3.8/threading.py”，第890行输入自举

线程0x00007F42BC9740（最新调用优先）：文件 “/home/user/anaconda3/envs/sas controller/lib/python3.8/csv.py”，目录列表文件中的第151行 “/home/user/anaconda3/envs/sas controller/lib/python3.8/csv.py”， writerow文件“/home/user/sas/bq\u extract\u 2.py”第154行 39在extract_表文件中 “/home/user/anaconda3/envs/sas controller/lib/python3.8/multiprocessing/pool.py”， mapstar文件中的第48行 “/home/user/anaconda3/envs/sas controller/lib/python3.8/multiprocessing/pool.py”，工作文件中的第125行 “/home/user/anaconda3/envs/sas controller/lib/python3.8/multiprocessing/process.py”，运行文件中的第108行 “/home/user/anaconda3/envs/sas controller/lib/python3.8/multiprocessing/process.py”， _引导文件中的第315行 “/home/user/anaconda3/envs/sas controller/lib/python3.8/multiprocessing/popen_fork.py”，启动文件中的第75行 “/home/user/anaconda3/envs/sas controller/lib/python3.8/multiprocessing/popen_fork.py”， init文件中的第19行 “/home/user/anaconda3/envs/sas controller/lib/python3.8/multiprocessing/context.py”， _Popen文件中的第277行 “/home/user/anaconda3/envs/sas controller/lib/python3.8/multiprocessing/process.py”，开始文件中的第121行 “/home/user/anaconda3/envs/sas controller/lib/python3.8/multiprocessing/pool.py”， “重新填充池”静态文件中的第326行 “/home/user/anaconda3/envs/sas controller/lib/python3.8/multiprocessing/pool.py”，重新填充池文件中的第303行 “/home/user/anaconda3/envs/sas controller/lib/python3.8/multiprocessing/pool.py”， init文件中的第212行 “/home/user/anaconda3/envs/sas controller/lib/python3.8/multiprocessing/context.py”，池文件“/home/user/sas/bq_extract_2.py”中的第119行，第157行模块内

编辑1:更新了超时时间。将行读取到10000，以允许从BQ读取较大的结果。还将max_stream_count更改为等于池将使用的核心数。这似乎对我的测试有很大帮助，但当我在Google Cloud Compute实例上作为启动脚本运行此脚本时，控制台输出中仍然会显示SEGFULTS

编辑2:我越是深入研究这个问题，就越不可能有效地将Python多处理与Google BigQuery存储API结合使用。鉴于调用os.fork（）后需要创建读取会话，我无法确保为各个进程分配正确的读取行数。每个会话都在与它所连接的BQ表创建自己的一对多（一个会话对多个流）关系，并且每个会话在流之间的表行划分似乎略有不同

以一个包含30行的表为例，我们希望用3个进程导出该表，每个进程处理一个行流。在手机上格式化可能看起来很奇怪

                       os.fork()

Process 1              Process 2              Process 3
Session1               Session2               Session3
*Stream1 - 10 rows     Stream1 - 8 rows       Stream1 - 9 rows
Stream2 - 10 rows      *Stream2 - 12 rows     Stream2 - 11 rows
Stream3 - 10 rows      Stream3 - 10 rows      *Stream3 - 10 rows

在本例中，我们最终得到32个输出行，因为每个会话并不以完全相同的方式定义其流

我尝试使用线程（下面的代码）而不是进程，这很有效，因为gRPC是线程安全的

# create read session here # Then call the target worker function with one thread per worker for i in range(0, num_streams): t = threading.Thread(target=extract_table, args=(i,)) t.start()
然而最大的问题是，使用8个线程所需的时间与使用1个线程所需的时间一样长，而且无论您现在使用多少个线程，线程间的聚合吞吐量似乎最高可达~5 MB/s
这与使用进程相比，在进程中，吞吐量似乎随着工作人员的增加而线性扩展（我在一些测试中看到高达~100 MB/s）……在极少数情况下，我能够在不中断工作的情况下使其工作。这似乎只是纯粹的运气
使用1个线程：
总时间：~3:11
使用8个线程：
总时间：~3:15
据我所知，使用多个线程基本上没有速度优势
如果有人对我遗漏的东西有什么想法，请告诉我
# create read session here # Then call the target worker function with one thread per worker for i in range(0, num_streams): t = threading.Thread(target=extract_table, args=(i,)) t.start()

with mp.Pool(processes=ncpus_buffer) as p: result = p.map(extract_table, list(range(0, num_streams)), chunksize=1)

with concurrent.futures.ThreadPoolExecutor(max_workers=num_streams) as p: result = p.map(extract_table, list(range(0, num_streams)), chunksize=1)