Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/ant/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Spark:如何获得贝努利朴素贝叶斯的概率和AUC?_Apache Spark_Pyspark_Apache Spark Mllib_Naivebayes_Apache Spark Ml - Fatal编程技术网

Apache spark Spark:如何获得贝努利朴素贝叶斯的概率和AUC?

Apache spark Spark:如何获得贝努利朴素贝叶斯的概率和AUC?,apache-spark,pyspark,apache-spark-mllib,naivebayes,apache-spark-ml,Apache Spark,Pyspark,Apache Spark Mllib,Naivebayes,Apache Spark Ml,我正在运行一个Bernoulli Naive Bayes使用代码: val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L) val training = splits(0).cache() val test = splits(1) val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli") 我的问题是如何获得0类(或1类)的成员概率

我正在运行一个
Bernoulli Naive Bayes
使用代码:

val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli")
我的问题是如何获得0类(或1类)的成员概率并计算AUC。我想得到与我使用此代码的地方的
LogisticRegressionWithSGD
SVMWithSGD
类似的结果:

val numIterations = 100

val model = SVMWithSGD.train(training, numIterations)
model.clearThreshold()

// Compute raw scores on the test set.
val labelAndPreds = test.map { point =>
      val prediction = model.predict(point.features)
      (prediction, point.label)
}

// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC() 

不幸的是,该代码不适用于
NaiveBayes

关于Bernouilli NaiveBayes的概率,下面是一个示例:

//构建虚拟数据
val data=sc.parallelize(列表(“0,100”、“1,01”、“1,01”、“0,01”、“1,110”))
//将虚拟数据转换为标签点
val parsedData=data.map{line=>
val parts=line.split(',')
标签点(部分(0).toDouble,向量.dense(部分(1).split(“”).map(u.toDouble)))
}
//为培训准备数据
val splits=parsedData.randomspilt(数组(0.75,0.25),种子=2L)
val training=splits(0).cache()
val测试=拆分(1)
val model=NaiveBayes.train(training,lambda=3.0,modelType=“bernoulli”)
//标签
val标签=model.labels
//所有特征向量的概率
val features=parsedData.map(lp=>lp.features)
模型。可预测概率(特征)。每个println取(10)
//对于一个特定的向量,我取parsedData中的第一个向量
val testVector=parsedData.first.features
println(s“对于向量${testVector}=>概率:${model.predictProbability(testVector)}”)
至于AUC:

// Compute raw scores on the test set.
val labelAndPreds = test.map { point =>
  val prediction = model.predict(point.features)
  (prediction, point.label)
}

// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC()
关于聊天室的询问:

val results=parsedData.map{lp=>
val概率:向量=模型预测概率(lp特征)
(对于(i值概率:向量=模型预测概率(lp.特征)
val bestClass=probs.argmax
(标签(最佳等级)、probs(最佳等级))
}
结果:每次打印10次
// (0.0,0.59728640251696)
// (1.0,0.745312681961104)
// (1.0,0.5291306032812298)
// (0.0,0.6496075621805428)
// (1.0,0.5841414717626924)
注意:适用于火花1.5+

编辑:(适用于Pyspark用户)

似乎有些人在使用pysparkmllib获取概率时遇到了问题。这很正常,spark mllib没有为pyspark提供该函数

因此,您需要使用基于数据帧的API:

from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import NaiveBayes

df = spark.createDataFrame([
    Row(label=0.0, features=Vectors.dense([0.0, 0.0])),
    Row(label=0.0, features=Vectors.dense([0.0, 1.0])),
    Row(label=1.0, features=Vectors.dense([1.0, 0.0]))])

nb = NaiveBayes(smoothing=1.0, modelType="bernoulli")
model = nb.fit(df)

model.transform(df).show(truncate=False)
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |features |label|rawPrediction                            |probability                             |prediction|
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |[0.0,0.0]|0.0  |[-1.4916548767777167,-2.420368128650429] |[0.7168141592920354,0.28318584070796465]|0.0       |
# |[0.0,1.0]|0.0  |[-1.4916548767777167,-3.1135153092103742]|[0.8350515463917526,0.16494845360824742]|0.0       |
# |[1.0,0.0]|1.0  |[-2.5902671654458262,-1.7272209480904837]|[0.29670329670329676,0.7032967032967034]|1.0       |
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
您只需要选择预测列并计算AUC


有关spark ml中朴素贝叶斯的更多信息,请参阅官方文档。

Ok这是一个二合一的问题。那么您使用的是哪一版本的spark?您还想要什么的概率?spark 1.5.0。我想要
p(Y=0 | X)
,有了这个,我就可以计算AUC了,对吗?是的,这是一个二进制分类我在使用spark.mllib非常感谢!我对它做了一些修改,现在我可以得到
(标签,P(y=0 | x))
val results=test.map{lp=>val probs:Vector=model.predictabilities(lp.features)val MyList=列表范围(0,(probs.size-1),2)(用于(i