Scala MLlib分类示例在阶段1中停止

Scala MLlib分类示例在阶段1中停止,scala,apache-spark,logistic-regression,apache-spark-mllib,Scala,Apache Spark,Logistic Regression,Apache Spark Mllib,编辑: 我试着使用加布里埃尔答案中的文本,得到了垃圾邮件特征:9和火腿特征:13。我尝试将HashingTF改为numFeatures=9,然后改为13,然后分别创建一个。然后程序像以前一样在“DataValidators.scala:38计数”处停止 已完成的作业(4) 计数为21(spamFeatures) 计数为23(特征) 计数为28(trainingData.count()) 首先是34的广义线性算法(val model=lrllearner.run(trainingData) 1)

编辑

我试着使用加布里埃尔答案中的文本,得到了垃圾邮件特征:9和火腿特征:13。我尝试将HashingTF改为numFeatures=9,然后改为13,然后分别创建一个。然后程序像以前一样在“DataValidators.scala:38计数”处停止

已完成的作业(4)
计数为21(spamFeatures)
计数为23(特征)
计数为28(trainingData.count())
首先是34的广义线性算法(val model=lrllearner.run(trainingData)

1) 为什么特性是按行计数的,就像在代码中它是按空格(“”)分割的一样

2) 我从我的密码和加布里埃尔的密码中看到了两件事: a) 我对logger没有任何了解,但这不应该是个问题…
b) 我的文件位于hdfs上(hdfs://ip-abc-de-.compute.internal:8020/user/ec2-user/spam.txt),再次强调,这应该不是问题,但不确定我是否遗漏了什么

3) 我应该让它运行多久?我已经让它运行了至少10分钟:local[2]

我猜在这一点上,我的Spark/MLlib设置可能存在某种问题?是否有一个更简单的程序可以运行以查看MLLib是否存在设置问题?我已经能够运行其他spark streaming/sql作业

谢谢

[转载自spark社区]

各位好,

我试图从Learning Spark运行这个MLlib示例:

我正在做的事情有所不同:

1) 我用的不是spam.txt和normal.txt,而是200字的文本文件……没有什么大的,只有纯文本,带句点、逗号等

3) 我使用了numFeatures=200、1000和10000

错误:我在尝试运行模型时一直被卡住(根据下面ui的详细信息):

val模型=新的LogisticRegressionWithGd().run(培训数据)

它会冻结在这样的东西上:

[第一阶段:=============>(1+0)/4]

webui中的一些详细信息:

org.apache.spark.rdd.RDD.count(RDD.scala:910)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
scala.collection.immutable.List.forall(List.scala:84)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
$line21.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
$line21.$read$$iwC$$iwC$$iwC.<init>(<console>:38)
$line21.$read$$iwC$$iwC.<init>(<console>:40)
$line21.$read$$iwC.<init>(<console>:42)
$line21.$read.<init>(<console>:44)
$line21.$read$.<init>(<console>:48)
$line21.$read$.<clinit>(<console>)
$line21.$eval$.<init>(<console>:7)
$line21.$eval$.<clinit>(<console>)
$line21.$eval.$print(<console>)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
org.apache.spark.rdd.rdd.count(rdd.scala:910)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
scala.collection.immutable.List.forall(List.scala:84)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
$line21.$read$$iwC$$iwC$$iwC$$iwC$$iwC。(:33)
$line21.$read$$iwC$$iwC$$iwC。(:38)
$line21.$read$$iwC$$iwC。(:40)
$line21.$read$$iwC.(:42)
$line21.$read.(:44)
$line21.$read$(:48)
$line21.$read$。()
$line21.$eval$(:7)
$line21.$eval$。()
$line21.$eval.$print()
sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)

我不知道我做错了什么…非常感谢您的帮助,谢谢

谢谢你的提问,我不知道这些示例,所以我下载并测试了它们。我看到的是,git存储库包含包含大量html代码的文件,它可以工作,但最终会添加100个特性,这可能就是为什么您无法获得一致的结果,因为您自己的文件包含的特性要少得多。我在没有html代码的情况下测试了这个works,从spam.txt和ham.txt中删除了html代码,如下所示:

ham.txt=

Dear Spark Learner, Thanks so much for attending the Spark Summit 2014!       
Check out videos of talks from the summit at ...
Hi Mom, Apologies for being late about emailing and forgetting to send you  
the package.  I hope you and bro have been ...
Wow, hey Fred, just heard about the Spark petabyte sort.  I think we need to  
take time to try it out immediately ...
Hi Spark user list, This is my first question to this list, so thanks in  
advance for your help!  I tried running ...
Thanks Tom for your email.  I need to refer you to Alice for this one.  I    
haven&#39;t yet figured out that part either ...
Good job yesterday!  I was attending your talk, and really enjoyed it.  I   
want to try out GraphX ...
Summit demo got whoops from audience!  Had to let you know. --Joe
spam.txt=

 Dear sir, I am a Prince in a far kingdom you have not heard of.  I want to 
 send you money via wire transfer so please ...
 Get Viagra real cheap!  Send money right away to ...
 Oh my gosh you can be really strong too with these drugs found in the     
 rainforest. Get them cheap right now ...
 YOUR COMPUTER HAS BEEN INFECTED!  YOU MUST RESET YOUR PASSWORD.  Reply to    
 this email with your password and SSN ...
 THIS IS NOT A SCAM!  Send money and get access to awesome stuff really   
 cheap and never have to ...
然后使用bellow Modified MLib.scala,确保在项目中引用了log4j,以将输出重定向到文件而不是控制台,因此基本上需要运行两次,在第一次运行时,通过在spam和ham中打印功能的#来查看输出,然后您可以设置我使用的正确的#功能(而不是100)

package com.oreilly.learningsparkexamples.scala

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.log4j.Logger

object MLlib {

private val logger = Logger.getLogger("MLlib")

def main(args: Array[String]) {
    logger.info("This is spark in Windows")
    val conf = new SparkConf().setAppName(s"Book example: Scala").setMaster("local[2]").set("spark.executor.memory","1g")
    //val conf = new SparkConf().setAppName(s"Book example: Scala")
    val sc = new SparkContext(conf)
    // Load 2 types of emails from text files: spam and ham (non-spam).
    // Each line has text from one email.
    val spam = sc.textFile("spam.txt")
    val ham = sc.textFile("ham.txt")
    // Create a HashingTF instance to map email text to vectors of 5 (not 100) features.
    val tf = new HashingTF(numFeatures = 5)
    // Each email is split into words, and each word is mapped to one feature.
    val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
    println ("features in spam " + spamFeatures.count())
    val hamFeatures = ham.map(email => tf.transform(email.split(" ")))
    println ("features in ham " + ham.count())
    // Create LabeledPoint datasets for positive (spam) and negative (ham) examples.
    val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))
    val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features))
    val trainingData = positiveExamples ++ negativeExamples
    trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.
    // Create a Logistic Regression learner which uses the LBFGS optimizer.
    val lrLearner = new LogisticRegressionWithSGD()
    // Run the actual learning algorithm on the training data.
    val model = lrLearner.run(trainingData)
    // Test on a positive example (spam) and a negative one (ham).
    // First apply the same HashingTF feature transformation used on the training data.
    val ex1 = "O M G GET cheap stuff by sending money to ...";
    val ex2 = "Hi Dad, I started studying Spark the other ..."
    val posTestExample = tf.transform(ex1.split(" "))
    val negTestExample = tf.transform(ex2.split(" "))
    // Now use the learned model to predict spam/ham for new emails.
    println(s"Prediction for positive test example: ${ex1} : ${model.predict(posTestExample)}")
    println(s"Prediction for negative test example: ${ex2} : ${model.predict(negTestExample)}")
    sc.stop()
  }
}
当我在输出中运行时,我得到:

features in spam 5
features in ham 7
Prediction for positive test example: O M G GET cheap stuff by sending money    
to ... : 1.0
Prediction for negative test example: Hi Dad, I started studying Spark the    
other ... : 0.0

我的本地集群上的Spark 1.5.2也有同样的问题。 我的程序在“DataValidators.scala:40计数”时停止。
通过以“spark submit--master local”的方式运行spark解决了本地集群上spark 1.5.2的类似问题。我的程序在“DataValidators.scala:40计数”时停止。我正在缓存我的训练功能。删除了缓存(只是没有调用缓存函数)并解决了该问题。但不确定实际原因

谢谢加布里埃尔的详细解释。几个后续问题,因为我仍然无法运行该程序…你介意检查我在主要问题中的编辑,因为在评论中很难提问吗?非常感谢。你能解释一下这样做背后的逻辑吗?据我所知,MLib迭代算法在缓存数据时应该运行得更好