Google cloud dataflow 云数据流洗牌操作期间临时作业位置中的FileNotFoundError

Google cloud dataflow 云数据流洗牌操作期间临时作业位置中的FileNotFoundError,google-cloud-dataflow,Google Cloud Dataflow,在GoogleCloudDataflow中运行批处理作业时,我在使用的特定步骤中遇到错误。该错误声明我为此管道指定的临时作业位置中不再存在特定文件 以下是完整stacktrace中最相关的部分: An exception was raised when trying to execute the workitem 2931621256965625980 : Traceback (most recent call last): File "/usr/local/lib/python3.7/si

在GoogleCloudDataflow中运行批处理作业时,我在使用的特定步骤中遇到错误。该错误声明我为此管道指定的临时作业位置中不再存在特定文件

以下是完整stacktrace中最相关的部分:

An exception was raised when trying to execute the workitem 2931621256965625980 : Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line 490, in __init__
    metadata = self._get_object_metadata(self._get_request)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/utils/retry.py", line 206, in wrapper
    return fun(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line 513, in _get_object_metadata
    return self._client.objects.Get(get_request)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/internal/clients/storage/storage_v1_client.py", line 1100, in Get
    download=download)
  File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 731, in _RunMethod
    return self.ProcessHttpResponse(method_config, http_response, request)
  File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 737, in ProcessHttpResponse
    self.__ProcessHttpResponse(method_config, http_response, request))
  File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 604, in __ProcessHttpResponse
    http_response, method_config=method_config, request=request)
apitools.base.py.exceptions.HttpNotFoundError: HttpError accessing <https://www.googleapis.com/storage/v1/b/<CLOUD STORAGE PATH FOR TEMPORARY JOB FILES>%2F<DATAFLOW JOB NAME>.1571774420.011973%2Ftmp-626a66561e20e8b6-00000-of-00003.avro?alt=json>: response: <{'x-guploader-uploadid': 'AEnB2UrVuWRWrrcneEjgvuGSwYR82tBqDdVa727Ylo8tVW6ucnPdeNbE2A8DXf7mDYqKKP42NdJapXZLR1UbCjvJ8n7w2SOVTMGFsrcbywKD1K9yxMWez7k', 'content-type': 'application/json; charset=UTF-8', 'date': 'Tue, 22 Oct 2019 20:43:59 GMT', 'vary': 'Origin, X-Origin', 'cache-control': 'no-cache, no-store, max-age=0, must-revalidate', 'expires': 'Mon, 01 Jan 1990 00:00:00 GMT', 'pragma': 'no-cache', 'content-length': '473', 'server': 'UploadServer', 'status': '404'}>, content <{
  "error": {
    "code": 404,
    "message": "No such object: <CLOUD STORAGE PATH FOR TEMPORARY JOB FILES>/<DATAFLOW JOB NAME>.1571774420.011973/tmp-626a66561e20e8b6-00000-of-00003.avro",
    "errors": [
      {
        "message": "No such object: <CLOUD STORAGE PATH FOR TEMPORARY JOB FILES>/<DATAFLOW JOB NAME>.1571774420.011973/tmp-626a66561e20e8b6-00000-of-00003.avro",
        "domain": "global",
        "reason": "notFound"
      }
    ]
  }
}
尝试执行工作项2931621256965625980:Traceback(最近一次调用last)时引发异常:
文件“/usr/local/lib/python3.7/site packages/apache_beam/io/gcp/gcsio.py”,第490行,在__
元数据=自我获取对象元数据(自我获取请求)
文件“/usr/local/lib/python3.7/site packages/apache_beam/utils/retry.py”,第206行,在包装器中
返回乐趣(*args,**kwargs)
文件“/usr/local/lib/python3.7/site packages/apache\u beam/io/gcp/gcsio.py”,第513行,在获取对象元数据中
返回self.\u client.objects.Get(Get\u请求)
Get中的文件“/usr/local/lib/python3.7/site packages/apache_beam/io/gcp/internal/clients/storage/storage_v1_client.py”,第1100行
下载=下载)
文件“/usr/local/lib/python3.7/site packages/apitools/base/py/base\u api.py”,第731行,in\u RunMethod
返回self.ProcessHttpResponse(方法配置、http响应、请求)
ProcessHttpResponse中的文件“/usr/local/lib/python3.7/site packages/apitools/base/py/base_api.py”,第737行
self.\uuuProcessHttpResponse(方法配置、http响应、请求))
文件“/usr/local/lib/python3.7/site-packages/apitools/base/py/base_-api.py”,第604行,在_-ProcessHttpResponse中
http_响应,方法_配置=方法_配置,请求=请求)

apitools.base.py.exceptions.HttpNotFoundError:HttpError访问:响应:,内容您需要添加
stagingLocation
gcpTempLocation
来解决此错误

有关更多详细信息,请查看此处[1]


1-

经过多次尝试和错误,我从未真正找到这个问题的答案。根本原因是Dataflow的Shuffle服务——似乎如果特定的Shuffle步骤非常昂贵,这些类型的间歇性连接问题最终会导致作业出错

我最终解决了这个问题,重新处理了数据集,将所需的洗牌量减少了一半左右。洗牌服务现在对我来说运行可靠


云数据流洗牌仍然是一项实验性功能——我希望这种不稳定性随着它的成熟而消失。

谢谢,但我已经指定了两者。为了澄清,这个特殊的错误发生在云洗牌操作的中间或接近尾声——上半场的效果非常好。作业失败后,我可以看到我的
staging_location
temp_location
路径中有剩余文件,表明数据流最初能够正确访问它们。您能用可复制的代码片段更新您的问题吗