Python 尝试上载大文件时发生GCS断管错误
我试图将一个.csv.gz文件解压缩到.csv后上传到GCS,文件大小从500MB变为5GB左右。我能够将.csv.gz文件提取到一个临时路径,但当我尝试将该文件上载到GCS时,它失败了。我得到以下错误:Python 尝试上载大文件时发生GCS断管错误,python,google-cloud-platform,google-cloud-storage,airflow,Python,Google Cloud Platform,Google Cloud Storage,Airflow,我试图将一个.csv.gz文件解压缩到.csv后上传到GCS,文件大小从500MB变为5GB左右。我能够将.csv.gz文件提取到一个临时路径,但当我尝试将该文件上载到GCS时,它失败了。我得到以下错误: [2019-11-11 13:59:58,180] {models.py:1796} ERROR - [Errno 32] Broken pipe Traceback (most recent call last) File "/usr/local/lib/airflow/airflow/
[2019-11-11 13:59:58,180] {models.py:1796} ERROR - [Errno 32] Broken pipe
Traceback (most recent call last)
File "/usr/local/lib/airflow/airflow/models.py", line 1664, in _run_raw_tas
result = task_copy.execute(context=context
File "/home/airflow/gcs/dags/operators/s3_to_gcs_transform_operator.py", line 220, in execut
gcs_hook.upload(dest_gcs_bucket, dest_gcs_object, target_file, gzip=True
File "/home/airflow/gcs/dags/hooks/gcs_hook_conn.py", line 208, in uploa
.insert(bucket=bucket, name=object, media_body=media)
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrappe
return wrapped(*args, **kwargs
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/http.py", line 835, in execut
method=str(self.method), body=self.body, headers=self.headers
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/http.py", line 179, in _retry_reques
raise exceptio
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/http.py", line 162, in _retry_reques
resp, content = http.request(uri, method, *args, **kwargs
File "/opt/python3.6/lib/python3.6/site-packages/google_auth_httplib2.py", line 198, in reques
uri, method, body=body, headers=request_headers, **kwargs
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_api_base_hook.py", line 155, in new_reques
redirections, connection_type
File "/opt/python3.6/lib/python3.6/site-packages/httplib2/__init__.py", line 1924, in reques
cachekey
File "/opt/python3.6/lib/python3.6/site-packages/httplib2/__init__.py", line 1595, in _reques
conn, request_uri, method, body, header
File "/opt/python3.6/lib/python3.6/site-packages/httplib2/__init__.py", line 1502, in _conn_reques
conn.request(method, request_uri, body, headers
File "/opt/python3.6/lib/python3.6/http/client.py", line 1239, in reques
self._send_request(method, url, body, headers, encode_chunked
File "/opt/python3.6/lib/python3.6/http/client.py", line 1285, in _send_reques
self.endheaders(body, encode_chunked=encode_chunked
File "/opt/python3.6/lib/python3.6/http/client.py", line 1234, in endheader
self._send_output(message_body, encode_chunked=encode_chunked
File "/opt/python3.6/lib/python3.6/http/client.py", line 1065, in _send_outpu
self.send(chunk
File "/opt/python3.6/lib/python3.6/http/client.py", line 986, in sen
self.sock.sendall(data
File "/opt/python3.6/lib/python3.6/ssl.py", line 975, in sendal
v = self.send(byte_view[count:]
File "/opt/python3.6/lib/python3.6/ssl.py", line 944, in sen
return self._sslobj.write(data
File "/opt/python3.6/lib/python3.6/ssl.py", line 642, in writ
return self._sslobj.write(data
BrokenPipeError: [Errno 32] Broken pip
据我所知,错误可能是由以下原因造成的:
您的服务器进程已收到一个向套接字写入的SIGPIPE。这
通常发生在您写入另一个完全关闭的套接字时
(客户)方。当客户端程序不运行时,可能会发生这种情况
等待接收到来自服务器的所有数据,然后简单地关闭
插座(使用关闭功能)
但我不知道这是否是问题所在,也不知道如何解决。有人能帮忙吗?你应该试着成批上传大文件
from google.cloud import storage
CHUNK_SIZE = 128 * 1024 * 1024
client = storage.Client()
bucket = client.bucket('destination')
blob = bucket.blob('really-big-blob', chunk_size=CHUNK_SIZE)
blob.upload_from_filename('/path/to/really-big-file')
你也可以查一下
类似的SO问题。请在运行程序时更新代码,我希望看到您所做的更改。是的,通过设置resumable=True并在gcs_hook.upload()中的MediaFileUpload()方法中指定chunksize()来修复此问题。