Numpy 如何用Pyspark';什么是SVM?

Numpy 如何用Pyspark';什么是SVM?,numpy,scipy,pyspark,rdd,apache-spark-mllib,Numpy,Scipy,Pyspark,Rdd,Apache Spark Mllib,我有两个numpy矩阵,如下所示: Features: (878049, 6) <type 'numpy.ndarray'> Labels: (878049,) <type 'numpy.ndarray'> 因此,我的问题是:我是否需要将numpy阵列转换为rdd,或者我应该以何种格式转换功能和标签矩阵,以便使它们与MLlib的RF实现相适应 更新 然后从@cafeed-answer我尝试了以下方法: In [24]: #CV (t

我有两个numpy矩阵,如下所示:

Features:
    (878049, 6)
    <type 'numpy.ndarray'>

Labels:
    (878049,)
    <type 'numpy.ndarray'>
因此,我的问题是:我是否需要将numpy阵列转换为rdd,或者我应该以何种格式转换
功能
标签
矩阵,以便使它们与MLlib的RF实现相适应

更新 然后从@cafeed-answer我尝试了以下方法:

In [24]:

#CV

(trainingData, testData) = data.randomSplit([0.7, 0.3])

In [26]:

from pyspark.mllib.tree import DecisionTree, DecisionTreeModel

from pyspark.mllib.util import MLUtils

import numpy as np

​

# Train a DecisionTree model.

# Empty categoricalFeaturesInfo indicates all features are continuous.

​

model = DecisionTree.trainClassifier(trainingData, numClasses=np.unique(y))

​

# Evaluate model on test instances and compute test error

predictions = model.predict(testData.map(lambda x: x.features))

labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)

testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())

print('Test Error = ' + str(testErr))

print('Learned classification tree model:')

print(model.toDebugString())

​
然而,我得到了一个例外:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-27-ded4b074521b> in <module>()
      6 # Empty categoricalFeaturesInfo indicates all features are continuous.
      7 
----> 8 model = DecisionTree.trainClassifier(trainingData, numClasses=np.unique(y), categoricalFeaturesInfo={},impurity='gini', maxDepth=5, maxBins=32)
      9 
     10 # Evaluate model on test instances and compute test error

/usr/local/Cellar/apache-spark/1.5.1/libexec/python/pyspark/mllib/tree.pyc in trainClassifier(cls, data, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
    183         """
    184         return cls._train(data, "classification", numClasses, categoricalFeaturesInfo,
--> 185                           impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
    186 
    187     @classmethod

/usr/local/Cellar/apache-spark/1.5.1/libexec/python/pyspark/mllib/tree.pyc in _train(cls, data, type, numClasses, features, impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
    124         assert isinstance(first, LabeledPoint), "the data should be RDD of LabeledPoint"
    125         model = callMLlibFunc("trainDecisionTreeModel", data, type, numClasses, features,
--> 126                               impurity, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
    127         return DecisionTreeModel(model)
    128 

/usr/local/Cellar/apache-spark/1.5.1/libexec/python/pyspark/mllib/common.pyc in callMLlibFunc(name, *args)
    128     sc = SparkContext._active_spark_context
    129     api = getattr(sc._jvm.PythonMLLibAPI(), name)
--> 130     return callJavaFunc(sc, api, *args)
    131 
    132 

/usr/local/Cellar/apache-spark/1.5.1/libexec/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func, *args)
    120 def callJavaFunc(sc, func, *args):
    121     """ Call Java Function """
--> 122     args = [_py2java(sc, a) for a in args]
    123     return _java2py(sc, func(*args))
    124 

/usr/local/Cellar/apache-spark/1.5.1/libexec/python/pyspark/mllib/common.pyc in _py2java(sc, obj)
     86     else:
     87         data = bytearray(PickleSerializer().dumps(obj))
---> 88         obj = sc._jvm.SerDe.loads(data)
     89     return obj
     90 

/usr/local/Cellar/apache-spark/1.5.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    536         answer = self.gateway_client.send_command(command)
    537         return_value = get_return_value(answer, self.gateway_client,
--> 538                 self.target_id, self.name)
    539 
    540         for temp_arg in temp_args:

/usr/local/Cellar/apache-spark/1.5.1/libexec/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     34     def deco(*a, **kw):
     35         try:
---> 36             return f(*a, **kw)
     37         except py4j.protocol.Py4JJavaError as e:
     38             s = e.java_exception.toString()

/usr/local/Cellar/apache-spark/1.5.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    298                 raise Py4JJavaError(
    299                     'An error occurred while calling {0}{1}{2}.\n'.
--> 300                     format(target_id, '.', name), value)
    301             else:
    302                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.mllib.api.python.SerDe.loads.
: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
    at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
    at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701)
    at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
    at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
    at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
    at org.apache.spark.mllib.api.python.SerDe$.loads(PythonMLLibAPI.scala:1462)
    at org.apache.spark.mllib.api.python.SerDe.loads(PythonMLLibAPI.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)
---------------------------------------------------------------------------
Py4JJavaError回溯(最近一次调用)
在()
6#空categoricalFeaturesInfo表示所有功能都是连续的。
7.
---->8 model=DecisionTree.trainClassifier(trainingData,numclass=np.unique(y),categoricalFeaturesInfo={},incluside='gini',maxDepth=5,maxBins=32)
9
10#在测试实例上评估模型并计算测试误差
/列车分类器中的usr/local/ceral/apache spark/1.5.1/libexec/python/pyspark/mllib/tree.pyc
183         """
184返回cls.\u列车(数据,“分类”,数字类,分类特征信息,
-->185杂质、maxDepth、maxBins、minInstancesPerNode、MinInfo增益)
186
187@classmethod
/usr/local/ceral/apache spark/1.5.1/libexec/python/pyspark/mllib/tree.pyc in_train(cls、数据、类型、numclass、特性、杂质、maxDepth、maxBins、minInstancesPerNode、mininfo增益)
124断言isinstance(首先是LabeledPoint),“数据应该是LabeledPoint的RDD”
125模型=callMLlibFunc(“列车决策树模型”),数据,类型,numclass,特征,
-->126杂质、maxDepth、maxBins、minInstancesPerNode、MinInfo增益)
127返回决策树模型(模型)
128
/callMLlibFunc(name,*args)中的usr/local/ceral/apachespark/1.5.1/libexec/python/pyspark/mllib/common.pyc
128 sc=SparkContext.\u活动\u spark\u上下文
129 api=getattr(sc.\u jvm.PythonMLLibAPI(),名称)
-->130返回callJavaFunc(sc、api、*args)
131
132
/callJavaFunc(sc,func,*args)中的usr/local/ceral/apachespark/1.5.1/libexec/python/pyspark/mllib/common.pyc
120 def callJavaFunc(sc,func,*args):
121“调用Java函数”
-->122 args=[[u py2java(sc,a)表示args中的a]
123返回_java2py(sc,func(*args))
124
/py2java(sc,obj)中的usr/local/ceral/apachespark/1.5.1/libexec/python/pyspark/mllib/common.pyc
86其他:
87 data=bytearray(PickleSerializer().dumps(obj))
--->88 obj=sc.\u jvm.SerDe.load(数据)
89返回obj
90
/usr/local/ceral/apache spark/1.5.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in__调用(self,*args)
536 answer=self.gateway\u client.send\u命令(command)
537返回值=获取返回值(应答,self.gateway\u客户端,
-->538 self.target_id,self.name)
539
540对于临时参数中的临时参数:
/deco中的usr/local/cillar/apachespark/1.5.1/libexec/python/pyspark/sql/utils.pyc(*a,**kw)
34 def装饰(*a,**千瓦):
35尝试:
--->36返回f(*a,**kw)
37除py4j.protocol.Py4JJavaError为e外:
38 s=e.java_exception.toString()
/获取返回值中的usr/local/ceral/apache spark/1.5.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py(答案、网关客户端、目标id、名称)
298 raise Py4JJavaError(
299'调用{0}{1}{2}时出错。\n'。
-->300格式(目标id,,,,名称),值)
301其他:
302升起Py4JError(
Py4JJavaError:调用z:org.apache.spark.mllib.api.python.SerDe.loads时出错。
:net.razorvine.pickle.PickleException:构造ClassDict(对于numpy.core.multiarray.\u)需要零个参数
位于net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
net.razorvine.pickle.Unpickler.load\u reduce(Unpickler.java:701)
位于net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
位于net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
加载net.razorvine.pickle.Unpickler.load(Unpickler.java:98)
位于org.apache.spark.mllib.api.python.SerDe$.loads(PythonMLLibAPI.scala:1462)
位于org.apache.spark.mllib.api.python.SerDe.loads(PythonMLLibAPI.scala)
在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处
位于sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)中
位于java.lang.reflect.Method.invoke(Method.java:497)
位于py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
位于py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
在py4j.Gateway.invoke处(Gateway.java:259)
位于py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
在py4j.commands.CallCommand.execute(CallCommand.java:79)
在py4j.GatewayConnection.run处(GatewayConnection.java:207)
运行(Thread.java:745)

文档已清除。您需要RDD:

>>> from pyspark.mllib.regression import LabeledPoint
>>> from pyspark.mllib.tree import RandomForest
>>> import numpy as np
>>>
>>> np.random.seed(1)
>>> features = np.random.random((100, 10))
>>> labels = np.random.choice([0, 1], 100)
>>> data = sc.parallelize(zip(labels, features)).map(lambda x: LabeledPoint(x[0], x[1])) 
>>> RandomForest.trainClassifier(data, numClasses=2, categoricalFeaturesInfo={}, numTrees=2)
TreeEnsembleModel classifier with 2 trees

文档已清除。您需要RDD:

>>> from pyspark.mllib.regression import LabeledPoint
>>> from pyspark.mllib.tree import RandomForest
>>> import numpy as np
>>>
>>> np.random.seed(1)
>>> features = np.random.random((100, 10))
>>> labels = np.random.choice([0, 1], 100)
>>> data = sc.parallelize(zip(labels, features)).map(lambda x: LabeledPoint(x[0], x[1])) 
>>> RandomForest.trainClassifier(data, numClasses=2, categoricalFeaturesInfo={}, numTrees=2)
TreeEnsembleModel classifier with 2 trees