Pyspark java.lang.AbstractMethodError:com/ibm/stocator/fs/common/IStoreClient.setStocatorPath(Lcom/ibm/stocator/fs/common/StocatorPath;)V

Pyspark java.lang.AbstractMethodError:com/ibm/stocator/fs/common/IStoreClient.setStocatorPath(Lcom/ibm/stocator/fs/common/StocatorPath;)V,pyspark,data-science-experience,ibm-cloud-storage,stocator,Pyspark,Data Science Experience,Ibm Cloud Storage,Stocator,我正试图从基于此的数据科学经验中访问IBMCOS上的数据 首先,我选择stocator的1.0.8版本 !pip install --user --upgrade pixiedust import pixiedust pixiedust.installPackage("com.ibm.stocator:stocator:1.0.8") 重新启动内核,然后 access_key = 'xxxx' secret_key = 'xxxx' bucket = 'xxxx' host = 'lon.ib

我正试图从基于此的数据科学经验中访问IBMCOS上的数据

首先,我选择stocator的1.0.8版本

!pip install --user --upgrade pixiedust
import pixiedust
pixiedust.installPackage("com.ibm.stocator:stocator:1.0.8")
重新启动内核,然后

access_key = 'xxxx'
secret_key = 'xxxx'
bucket = 'xxxx'
host = 'lon.ibmselect.objstor.com'

hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.s3d.service.endpoint", "http://" + host)
hconf.set("fs.s3d.service.access.key", access_key)
hconf.set("fs.s3d.service.secret.key", secret_key)

file = 'mydata_file.tsv.gz'

inputDataset = "s3d://{}.service/{}".format(bucket, file)

lines = sc.textFile(inputDataset, 1)
lines.count()
但是,这会导致以下错误:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.AbstractMethodError: com/ibm/stocator/fs/common/IStoreClient.setStocatorPath(Lcom/ibm/stocator/fs/common/StocatorPath;)V
    at com.ibm.stocator.fs.ObjectStoreFileSystem.initialize(ObjectStoreFileSystem.java:104)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:249)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:249)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:249)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:249)
    at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:53)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:249)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:249)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1927)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:932)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:378)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:931)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
    at java.lang.reflect.Method.invoke(Method.java:507)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:785)

注意:我第一次尝试连接到IBM COS时出现了另一个错误。这里捕捉到了这种尝试:

Chris,我通常不在端点中使用'http://',这对我很有效。不确定这是否是问题所在

以下是我如何从DSX笔记本访问COS对象

endpoint = "s3-api.dal-us-geo.objectstorage.softlayer.net"

hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.s3d.service.endpoint",endpoint)
hconf.set("fs.s3d.service.access.key",Access_Key_ID)
hconf.set("fs.s3d.service.secret.key",Secret_Access_Key)

inputObject = "s3d://<bucket>.service/<file>"
myRDD = sc.textFile(inputObject,1)
endpoint=“s3api.dal-us-geo.objectstorage.softlayer.net”
hconf=sc.\u jsc.hadoopConfiguration()
set(“fs.s3d.service.endpoint”,endpoint)
hconf.set(“fs.s3d.service.access.key”,access\u key\u ID)
hconf.set(“fs.s3d.service.secret.key”,secret\u Access\u key)
inputObject=“s3d://.service/”
myRDD=sc.textFile(inputObject,1)

Chris,我通常不在端点中使用“http://”,这对我很有用。不确定这是否是问题所在

以下是我如何从DSX笔记本访问COS对象

endpoint = "s3-api.dal-us-geo.objectstorage.softlayer.net"

hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.s3d.service.endpoint",endpoint)
hconf.set("fs.s3d.service.access.key",Access_Key_ID)
hconf.set("fs.s3d.service.secret.key",Secret_Access_Key)

inputObject = "s3d://<bucket>.service/<file>"
myRDD = sc.textFile(inputObject,1)
endpoint=“s3api.dal-us-geo.objectstorage.softlayer.net”
hconf=sc.\u jsc.hadoopConfiguration()
set(“fs.s3d.service.endpoint”,endpoint)
hconf.set(“fs.s3d.service.access.key”,access\u key\u ID)
hconf.set(“fs.s3d.service.secret.key”,secret\u Access\u key)
inputObject=“s3d://.service/”
myRDD=sc.textFile(inputObject,1)

DSX在Spark 2.0和Spark 2.1内核的类路径上有一个stocator版本。实例中安装的版本可能与预安装的版本冲突。

DSX在Spark 2.0和Spark 2.1内核的类路径上有一个stocator版本。实例中安装的版本可能与预安装的版本冲突。

无需安装stocator,或者它已经存在。正如罗兰所提到的,新的安装很可能会和预装的发生冲突并导致冲突

试试ibmos2spark:


如果您仍然面临问题,请告诉我。

无需安装stocator,或者它已经存在。正如罗兰所提到的,新的安装很可能会和预装的发生冲突并导致冲突

试试ibmos2spark:


如果您仍然面临问题,请告诉我。

除非您有很好的理由,否则不要强制安装新的Stocator

我强烈推荐Spark aaS文档,网址为:

请从以下位置选择正确的COS端点:

如果您在IBM云中工作,请使用私有端点。它会更快更便宜

这里有一些如何使用所有优秀助手访问COS数据的示例。可以归结为

import ibmos2spark

credentials = {
  'endpoint': 's3-api.us-geo.objectstorage.service.networklayer.com',  #just an example. Your url might be different
  'access_key': 'my access key',
  'secret_key': 'my secret key'
}
bucket_name = 'my bucket name'
object_name = 'mydata_file.tsv.gz'

cos = ibmos2spark.CloudObjectStorage(sc, credentials)
lines = sc.textFile(cos.url(object_name, bucket_name),1)
lines.count()

除非有很好的理由,否则不要强制安装新的Stocator

我强烈推荐Spark aaS文档,网址为:

请从以下位置选择正确的COS端点:

如果您在IBM云中工作,请使用私有端点。它会更快更便宜

这里有一些如何使用所有优秀助手访问COS数据的示例。可以归结为

import ibmos2spark

credentials = {
  'endpoint': 's3-api.us-geo.objectstorage.service.networklayer.com',  #just an example. Your url might be different
  'access_key': 'my access key',
  'secret_key': 'my secret key'
}
bucket_name = 'my bucket name'
object_name = 'mydata_file.tsv.gz'

cos = ibmos2spark.CloudObjectStorage(sc, credentials)
lines = sc.textFile(cos.url(object_name, bucket_name),1)
lines.count()

注意:我第一次运行问题中的脚本时没有使用pixiedust安装stocator。有一个错误,但不幸的是,在使用预装的stocator运行时,我没有捕捉到错误。Iirc,最初的错误表明没有安装stocator,这就是为什么我接着使用pixiedust安装stocator。问题已经调查。我使用的英国专用dsx环境没有安装正确版本的stocator。修复依赖项后,我将重试。注意:我第一次运行问题中的脚本时没有使用pixiedust安装stocator。有一个错误,但不幸的是,在使用预装的stocator运行时,我没有捕捉到错误。Iirc,最初的错误表明没有安装stocator,这就是为什么我接着使用pixiedust安装stocator。问题已经调查。我使用的英国专用dsx环境没有安装正确版本的stocator。我将在修复依赖项后重试。