Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/google-cloud-platform/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Google cloud platform 使用BigQuery Spark连接器从Dataproc与Datalab连接到BigQuery时出错(从位于的元数据服务器获取访问令牌时出错)_Google Cloud Platform_Google Bigquery_Google Cloud Dataproc - Fatal编程技术网

Google cloud platform 使用BigQuery Spark连接器从Dataproc与Datalab连接到BigQuery时出错(从位于的元数据服务器获取访问令牌时出错)

Google cloud platform 使用BigQuery Spark连接器从Dataproc与Datalab连接到BigQuery时出错(从位于的元数据服务器获取访问令牌时出错),google-cloud-platform,google-bigquery,google-cloud-dataproc,Google Cloud Platform,Google Bigquery,Google Cloud Dataproc,我有BigQuery表、Dataproc cluster和Datalab,我遵循以下指南: 当我尝试连接到公共数据集时,脚本工作正常。但是,当我尝试连接到我的私有数据集时,我收到以下错误: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. : java.io.IOException: Error getting access token

我有BigQuery表、Dataproc cluster和Datalab,我遵循以下指南:

当我尝试连接到公共数据集时,脚本工作正常。但是,当我尝试连接到我的私有数据集时,我收到以下错误:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.io.IOException: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token
    at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:210)
    at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:75)
    at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.createBigQueryCredential(BigQueryFactory.java:82)
    at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQuery(BigQueryFactory.java:102)
    at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQueryHelper(BigQueryFactory.java:90)
    at com.google.cloud.hadoop.io.bigquery.AbstractBigQueryInputFormat.getBigQueryHelper(AbstractBigQueryInputFormat.java:357)
    at com.google.cloud.hadoop.io.bigquery.AbstractBigQueryInputFormat.getSplits(AbstractBigQueryInputFormat.java:108)
    at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:125)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.take(RDD.scala:1327)
    at org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:203)
    at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:587)
    at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: metadata
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
    at sun.net.www.http.HttpClient.New(HttpClient.java:339)
    at sun.net.www.http.HttpClient.New(HttpClient.java:357)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1220)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:984)
    at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
    at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:159)
    at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
    at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:208)
    ... 35 more
其他一些信息:

我正在通过Datalab使用python pySpark,它是通过 BigQuery数据在美国,Dataproc集群在欧盟 Dataproc映像是最新的1.2版本 Dataproc集群被配置为具有google范围的API访问权限
根据从元数据服务器获取访问令牌时收到的错误消息,位于:http://metadata/computeMetadata/v1/instance/service-accounts/default/token […]由以下原因引起:java.net.UnknownHostException:metadata,看起来错误正确

为了简化您的用例场景,我建议您首先缩小您正在使用的产品的范围,因为失败可能发生在不同的步骤中。为此,我建议您直接从已经运行的Dataproc集群运行PySpark代码,如下所示:

转到GCP控制台中的Dataproc>群集菜单。 进入您正在使用的集群,然后进入VM实例选项卡。 通过单击主节点名称旁边的SSH按钮,将SSH导入主节点。 创建包含要运行的PySpark代码的脚本words.py。 使用spark submit words.py命令运行脚本。 一旦这样做,请检查是否收到相同的错误消息。如果这样做,问题应该在Dataproc/BigQuery端。如果没有,它很可能位于Datalab中。我猜您会收到相同的错误消息,因为它看起来像是凭据问题

一旦您确定了问题所在,当您登录到集群中的主节点时,通过在终端中运行以下命令查看您正在使用的服务帐户:

gcloud auth list
另外,通过运行下面的命令,确保环境变量GOOGLE\u APPLICATION\u CREDENTIALS为空。如果为空,则运行节点的VM实例将使用GCE的默认服务帐户,该帐户应为运行gcloud auth list时获得的帐户,因为Dataproc在GCE实例上运行。如果不为空,它将使用此环境变量所指向的凭据文件。是使用默认凭据还是自定义凭据是一种实现选择

echo $GOOGLE_APPLICATION_CREDENTIALS
一旦您知道正在使用哪个服务帐户,请移动到控制台中的IAM选项卡,并检查此服务帐户是否具有权限

我的猜测是,问题可能与正在使用的服务帐户有关,并且可能是GOOGLE_应用程序_凭据指向了错误的位置,因此您应该首先确保您的身份验证配置是正确的;为此,我将直接从主节点内部运行代码,以简化用例并减少所涉及的组件

echo $GOOGLE_APPLICATION_CREDENTIALS