Numpy 如何用Pyspark';什么是SVM?
我有两个numpy矩阵,如下所示:Numpy 如何用Pyspark';什么是SVM?,numpy,scipy,pyspark,rdd,apache-spark-mllib,Numpy,Scipy,Pyspark,Rdd,Apache Spark Mllib,我有两个numpy矩阵,如下所示: Features: (878049, 6) <type 'numpy.ndarray'> Labels: (878049,) <type 'numpy.ndarray'> 因此,我的问题是:我是否需要将numpy阵列转换为rdd,或者我应该以何种格式转换功能和标签矩阵,以便使它们与MLlib的RF实现相适应 更新 然后从@cafeed-answer我尝试了以下方法: In [24]: #CV (t
Features:
(878049, 6)
<type 'numpy.ndarray'>
Labels:
(878049,)
<type 'numpy.ndarray'>
因此,我的问题是:我是否需要将numpy阵列转换为rdd,或者我应该以何种格式转换功能
和标签
矩阵,以便使它们与MLlib的RF实现相适应
更新
然后从@cafeed-answer我尝试了以下方法:
In [24]:
#CV
(trainingData, testData) = data.randomSplit([0.7, 0.3])
In [26]:
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
import numpy as np
# Train a DecisionTree model.
# Empty categoricalFeaturesInfo indicates all features are continuous.
model = DecisionTree.trainClassifier(trainingData, numClasses=np.unique(y))
# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())
然而,我得到了一个例外:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-27-ded4b074521b> in <module>()
6 # Empty categoricalFeaturesInfo indicates all features are continuous.
7
----> 8 model = DecisionTree.trainClassifier(trainingData, numClasses=np.unique(y), categoricalFeaturesInfo={},impurity='gini', maxDepth=5, maxBins=32)
9
10 # Evaluate model on test instances and compute test error
/usr/local/Cellar/apache-spark/1.5.1/libexec/python/pyspark/mllib/tree.pyc in trainClassifier(cls, data, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
183 """
184 return cls._train(data, "classification", numClasses, categoricalFeaturesInfo,
--> 185 impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
186
187 @classmethod
/usr/local/Cellar/apache-spark/1.5.1/libexec/python/pyspark/mllib/tree.pyc in _train(cls, data, type, numClasses, features, impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
124 assert isinstance(first, LabeledPoint), "the data should be RDD of LabeledPoint"
125 model = callMLlibFunc("trainDecisionTreeModel", data, type, numClasses, features,
--> 126 impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
127 return DecisionTreeModel(model)
128
/usr/local/Cellar/apache-spark/1.5.1/libexec/python/pyspark/mllib/common.pyc in callMLlibFunc(name, *args)
128 sc = SparkContext._active_spark_context
129 api = getattr(sc._jvm.PythonMLLibAPI(), name)
--> 130 return callJavaFunc(sc, api, *args)
131
132
/usr/local/Cellar/apache-spark/1.5.1/libexec/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func, *args)
120 def callJavaFunc(sc, func, *args):
121 """ Call Java Function """
--> 122 args = [_py2java(sc, a) for a in args]
123 return _java2py(sc, func(*args))
124
/usr/local/Cellar/apache-spark/1.5.1/libexec/python/pyspark/mllib/common.pyc in _py2java(sc, obj)
86 else:
87 data = bytearray(PickleSerializer().dumps(obj))
---> 88 obj = sc._jvm.SerDe.loads(data)
89 return obj
90
/usr/local/Cellar/apache-spark/1.5.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:
/usr/local/Cellar/apache-spark/1.5.1/libexec/python/pyspark/sql/utils.pyc in deco(*a, **kw)
34 def deco(*a, **kw):
35 try:
---> 36 return f(*a, **kw)
37 except py4j.protocol.Py4JJavaError as e:
38 s = e.java_exception.toString()
/usr/local/Cellar/apache-spark/1.5.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.mllib.api.python.SerDe.loads.
: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
at org.apache.spark.mllib.api.python.SerDe$.loads(PythonMLLibAPI.scala:1462)
at org.apache.spark.mllib.api.python.SerDe.loads(PythonMLLibAPI.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
---------------------------------------------------------------------------
Py4JJavaError回溯(最近一次调用)
在()
6#空categoricalFeaturesInfo表示所有功能都是连续的。
7.
---->8 model=DecisionTree.trainClassifier(trainingData,numclass=np.unique(y),categoricalFeaturesInfo={},incluside='gini',maxDepth=5,maxBins=32)
9
10#在测试实例上评估模型并计算测试误差
/列车分类器中的usr/local/ceral/apache spark/1.5.1/libexec/python/pyspark/mllib/tree.pyc
183 """
184返回cls.\u列车(数据,“分类”,数字类,分类特征信息,
-->185杂质、maxDepth、maxBins、minInstancesPerNode、MinInfo增益)
186
187@classmethod
/usr/local/ceral/apache spark/1.5.1/libexec/python/pyspark/mllib/tree.pyc in_train(cls、数据、类型、numclass、特性、杂质、maxDepth、maxBins、minInstancesPerNode、mininfo增益)
124断言isinstance(首先是LabeledPoint),“数据应该是LabeledPoint的RDD”
125模型=callMLlibFunc(“列车决策树模型”),数据,类型,numclass,特征,
-->126杂质、maxDepth、maxBins、minInstancesPerNode、MinInfo增益)
127返回决策树模型(模型)
128
/callMLlibFunc(name,*args)中的usr/local/ceral/apachespark/1.5.1/libexec/python/pyspark/mllib/common.pyc
128 sc=SparkContext.\u活动\u spark\u上下文
129 api=getattr(sc.\u jvm.PythonMLLibAPI(),名称)
-->130返回callJavaFunc(sc、api、*args)
131
132
/callJavaFunc(sc,func,*args)中的usr/local/ceral/apachespark/1.5.1/libexec/python/pyspark/mllib/common.pyc
120 def callJavaFunc(sc,func,*args):
121“调用Java函数”
-->122 args=[[u py2java(sc,a)表示args中的a]
123返回_java2py(sc,func(*args))
124
/py2java(sc,obj)中的usr/local/ceral/apachespark/1.5.1/libexec/python/pyspark/mllib/common.pyc
86其他:
87 data=bytearray(PickleSerializer().dumps(obj))
--->88 obj=sc.\u jvm.SerDe.load(数据)
89返回obj
90
/usr/local/ceral/apache spark/1.5.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in__调用(self,*args)
536 answer=self.gateway\u client.send\u命令(command)
537返回值=获取返回值(应答,self.gateway\u客户端,
-->538 self.target_id,self.name)
539
540对于临时参数中的临时参数:
/deco中的usr/local/cillar/apachespark/1.5.1/libexec/python/pyspark/sql/utils.pyc(*a,**kw)
34 def装饰(*a,**千瓦):
35尝试:
--->36返回f(*a,**kw)
37除py4j.protocol.Py4JJavaError为e外:
38 s=e.java_exception.toString()
/获取返回值中的usr/local/ceral/apache spark/1.5.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py(答案、网关客户端、目标id、名称)
298 raise Py4JJavaError(
299'调用{0}{1}{2}时出错。\n'。
-->300格式(目标id,,,,名称),值)
301其他:
302升起Py4JError(
Py4JJavaError:调用z:org.apache.spark.mllib.api.python.SerDe.loads时出错。
:net.razorvine.pickle.PickleException:构造ClassDict(对于numpy.core.multiarray.\u)需要零个参数
位于net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
net.razorvine.pickle.Unpickler.load\u reduce(Unpickler.java:701)
位于net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
位于net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
加载net.razorvine.pickle.Unpickler.load(Unpickler.java:98)
位于org.apache.spark.mllib.api.python.SerDe$.loads(PythonMLLibAPI.scala:1462)
位于org.apache.spark.mllib.api.python.SerDe.loads(PythonMLLibAPI.scala)
在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处
位于sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)中
位于java.lang.reflect.Method.invoke(Method.java:497)
位于py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
位于py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
在py4j.Gateway.invoke处(Gateway.java:259)
位于py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
在py4j.commands.CallCommand.execute(CallCommand.java:79)
在py4j.GatewayConnection.run处(GatewayConnection.java:207)
运行(Thread.java:745)
文档已清除。您需要RDD:
>>> from pyspark.mllib.regression import LabeledPoint
>>> from pyspark.mllib.tree import RandomForest
>>> import numpy as np
>>>
>>> np.random.seed(1)
>>> features = np.random.random((100, 10))
>>> labels = np.random.choice([0, 1], 100)
>>> data = sc.parallelize(zip(labels, features)).map(lambda x: LabeledPoint(x[0], x[1]))
>>> RandomForest.trainClassifier(data, numClasses=2, categoricalFeaturesInfo={}, numTrees=2)
TreeEnsembleModel classifier with 2 trees
文档已清除。您需要RDD:
>>> from pyspark.mllib.regression import LabeledPoint
>>> from pyspark.mllib.tree import RandomForest
>>> import numpy as np
>>>
>>> np.random.seed(1)
>>> features = np.random.random((100, 10))
>>> labels = np.random.choice([0, 1], 100)
>>> data = sc.parallelize(zip(labels, features)).map(lambda x: LabeledPoint(x[0], x[1]))
>>> RandomForest.trainClassifier(data, numClasses=2, categoricalFeaturesInfo={}, numTrees=2)
TreeEnsembleModel classifier with 2 trees