Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/280.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用Python在Spark中对LIBSVM文件进行特征选择和缩减?_Python_Apache Spark_Pyspark_Apache Spark Mllib_Chi Squared - Fatal编程技术网

如何使用Python在Spark中对LIBSVM文件进行特征选择和缩减?

如何使用Python在Spark中对LIBSVM文件进行特征选择和缩减?,python,apache-spark,pyspark,apache-spark-mllib,chi-squared,Python,Apache Spark,Pyspark,Apache Spark Mllib,Chi Squared,我有几个LIBSVM文件,我必须用它们在spark中使用python实现集群。文件的分隔符为空格,第一列表示类型[1或-1],其余均为[1:2.566]格式的特征。有很多这样的专栏,我想在此基础上进行特征选择[最好实现ChiSquareTest模型],然后使用PCA或SVD执行特征缩减过程。但是,我在spark中找不到一个像样的python教程来实现这些过程 我在网上找到了一个用python实现Chisqtest的示例脚本。我使用相同的逻辑来实现模型,但我无法完成它。在该链接中的假设检验部分下,

我有几个LIBSVM文件,我必须用它们在spark中使用python实现集群。文件的分隔符为空格,第一列表示类型[1或-1],其余均为[1:2.566]格式的特征。有很多这样的专栏,我想在此基础上进行特征选择[最好实现ChiSquareTest模型],然后使用PCA或SVD执行特征缩减过程。但是,我在spark中找不到一个像样的python教程来实现这些过程

我在网上找到了一个用python实现Chisqtest的示例脚本。我使用相同的逻辑来实现模型,但我无法完成它。在该链接中的假设检验部分下,代码在传递到ChiSqTest模型之前并行化RDD[LabeledPoint]。我以不同的方式尝试了相同的逻辑,但我得到了不同的错误

data = MLUtils.loadLibSVMFile(sc, "PATH/FILENAME.txt")
label = data.map(lambda x: x.label)
features = data.map(lambda x: x.features)
obs = sc.parallelize(LabeledPoint(label,features))
这给了我一个错误,说明TypeError:float()参数必须是字符串或数字

然后,我使用Normalizer()规范化了数据,做了同样的事情,得到了同样的错误。所以,我写了一个函数,返回一个标签点

def parsepoint(line):
    values = [float(x) for x in line.split(' ')]
    return sc.parallelize(LabeledPoint(values[0],values[1:]))
parsedData = data.map(lambda x: parsepoint(x))
obs = sc.parallelize(parsedData)
这给了我一个错误,说明管道RDD不适合。我尝试了其他几种方法,结果一切都出错了。有人能告诉我哪里出了问题吗?而且,对于使用PCA或SVD的特征缩减过程,我找不到python中的示例脚本。这方面的任何意见都会对我非常有帮助

堆栈跟踪:

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-1-8d0164c0957d> in <module>()
  10 sct = SparkContext()
  11 data = MLUtils.loadLibSVMFile(sct, "PATH")
  ---> 12 print data.take(1)
  13 #label = data.map(lambda x: x.label)
  14 #features = data.map(lambda x: x.features)

  SPARK_HOME\rdd.pyc in take(self, num)
  1263 
  1264 p = range(partsScanned, min(partsScanned + numPartsToTry,   totalParts))
 -> 1265 res = self.context.runJob(self, takeUpToNumLeft, p, True)
  1266 
  1267 items += res

  SPARK_HOME\context.pyc in runJob(self, rdd, partitionFunc, partitions, allowLocal)
   879         mappedRDD = rdd.mapPartitions(partitionFunc)
   880         port = self._jvm.PythonRDD.runJob(self._jsc.sc(),   mappedRDD._jrdd, partitions,
   --> 881 allowLocal)
   882         return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
   883 
      SPARK\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py in __call__(self, *args)
   536 answer = self.gateway_client.send_command(command)
   537 return_value = get_return_value(answer, self.gateway_client,
   --> 538 self.target_id, self.name)
   539 
   540  for temp_arg in temp_args:
    SPARK\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
   298 raise Py4JJavaError(
   299 'An error occurred while calling {0}{1}{2}.\n'.
   --> 300  format(target_id, '.', name), value)
   301   else:
   302   raise Py4JError(

  Py4JJavaError: An error occurred while calling       z:org.apache.spark.api.python.PythonRDD.runJob.
  : org.apache.spark.SparkException: Job aborted due to stage failure: Task   0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0   (TID 2, localhost): java.net.SocketException: Connection reset by peer: socket      write error
   at java.net.SocketOutputStream.socketWrite0(Native Method)
   at java.net.SocketOutputStream.socketWrite(Unknown Source)
   at java.net.SocketOutputStream.write(Unknown Source)
   at java.io.BufferedOutputStream.write(Unknown Source)
   at java.io.DataOutputStream.write(Unknown Source)
   at java.io.FilterOutputStream.write(Unknown Source)
   at    org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$wr       ite$1(PythonRDD.scala:413)
   at   org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
   at  org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at   org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
   at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:425)
   at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRD D.scala:248)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
   at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)

  Driver stacktrace:
   at   org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$ $failJobAndIndependentStages(DAGScheduler.scala:1266)
  at  org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler .scala:1257)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
  at   scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAG Scheduler.scala:730)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
  at scala.Option.foreach(Option.scala:236)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala :1411)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Py4JJavaError回溯(最近一次调用)
在()
10 sct=SparkContext()
11 data=MLUtils.loadLibSVMFile(sct,“路径”)
--->12打印数据。取(1)
13#label=data.map(lambda x:x.label)
14#features=data.map(lambda x:x.features)
SPARK_HOME\rdd.pyc in take(self,num)
1263
1264 p=范围(零件扫描,最小值(零件扫描+数值扫描,总零件))
->1265 res=self.context.runJob(self,takeUpToNumLeft,p,True)
1266
1267项+=res
runJob中的SPARK\u HOME\context.pyc(self、rdd、partitionFunc、partitions、allowLocal)
879 mappedRDD=rdd.mapPartitions(partitionFunc)
880 port=self.\u jvm.PythonRDD.runJob(self.\u jsc.sc(),mappedRDD.\u jrdd,分区,
-->881(本地)
882返回列表(\u从\u套接字加载\u(端口,mapperdd.\u jrdd\u反序列化器))
883
SPARK\python\lib\py4j-0.8.2.1-src.zip\py4j\java\u gateway.py in\uuuu调用(self,*args)
536 answer=self.gateway\u client.send\u命令(command)
537返回值=获取返回值(应答,self.gateway\u客户端,
-->538 self.target_id,self.name)
539
540对于临时参数中的临时参数:
获取返回值中的SPARK\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py(应答、网关客户端、目标id、名称)
298 raise Py4JJavaError(
299'调用{0}{1}{2}时出错。\n'。
-->300格式(目标id,,,,名称),值)
301其他:
302升起Py4JError(
Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.runJob时出错。
:org.apache.spark.sparkeexception:作业因阶段失败而中止:阶段1.0中的任务0失败1次,最近的失败:阶段1.0中的任务0.0丢失(TID 2,本地主机):java.net.SocketException:对等方重置连接:套接字写入错误
位于java.net.SocketOutputStream.socketWrite0(本机方法)
位于java.net.SocketOutputStream.socketWrite(未知源)
位于java.net.SocketOutputStream.write(未知源)
位于java.io.BufferedOutputStream.write(未知源)
位于java.io.DataOutputStream.write(未知源)
位于java.io.FilterOutputStream.write(未知源)
位于org.apache.spark.api.PythonRDD$.org$apache$spark$api$PythonRDD$$write$1(PythonRDD.scala:413)
位于org.apache.spark.api.PythonRDD$$anonfun$writeiteiteratortostream$1.apply(PythonRDD.scala:425)
位于org.apache.spark.api.PythonRDD$$anonfun$writeiteiteratortostream$1.apply(PythonRDD.scala:425)
位于scala.collection.Iterator$class.foreach(Iterator.scala:727)
在org.apache.spark.interruptblediator.foreach(interruptblediator.scala:28)
位于org.apache.spark.api.PythonRDD$.writeiteiteratortostream(PythonRDD.scala:425)
位于org.apache.spark.api.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRD D.scala:248)
位于org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
位于org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)
驱动程序堆栈跟踪:
位于org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
位于org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
位于org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
位于scala.collection.mutable.resizeblearray$class.foreach(resizeblearray.scala:59)
位于scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
位于org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
位于org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAG scheduler.scala:730)
位于org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
位于scala.Option.foreach(Option.scala:236)
位于org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
位于org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
位于org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
位于org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

MLUtils.loadLibSVMFile
返回
RDD[LabeledPoint]
,因此您可以将输出直接传递到
Statistics.chiS
from pyspark.mllib.util import MLUtils
from pyspark.mllib.stat import Statistics

data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
chiSqResults = Statistics.chiSqTest(data)

print chiSqResults[-1]
    def parsepoint(line):
        values = line.split(" ")
        return LabeledPoint(values[0], values[1:])

    parsedData = map(parsepoint, data.take(1))
    firstFeatures = parsedData[0].features
    firstLabel = parsedData[0].label
    print firstFeatures
    print firstLabel