Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 使用text8文件的Spark Word2Vec示例_Apache Spark - Fatal编程技术网

Apache spark 使用text8文件的Spark Word2Vec示例

Apache spark 使用text8文件的Spark Word2Vec示例,apache-spark,Apache Spark,我试图从apache.spark.org(代码如下&整个教程在这里:)运行这个示例,使用他们在站点()上引用的text8文件: 当我尝试适应这个模型时,我总是会遇到java堆错误。我在python中也得到了同样的结果。我还使用java_选项增加了java内存大小 该文件只有100MB,所以我认为我的内存设置不正确,但我不确定这是否是根本原因 还有人在笔记本电脑上尝试过这个例子吗 我不能把文件放在我们公司的服务器上,因为我们不应该导入外部数据,所以我只能在我的个人笔记本电脑上工作。如果你有什么建议

我试图从apache.spark.org(代码如下&整个教程在这里:)运行这个示例,使用他们在站点()上引用的text8文件:

当我尝试适应这个模型时,我总是会遇到java堆错误。我在python中也得到了同样的结果。我还使用java_选项增加了java内存大小

该文件只有100MB,所以我认为我的内存设置不正确,但我不确定这是否是根本原因

还有人在笔记本电脑上尝试过这个例子吗


我不能把文件放在我们公司的服务器上,因为我们不应该导入外部数据,所以我只能在我的个人笔记本电脑上工作。如果你有什么建议,我很乐意听。谢谢

首先,我是Spark的新手,所以其他人可能会有更快更好的解决方案。 我在运行这个示例代码时遇到了同样的困难。 我设法使其工作,主要是通过:

  • 在我的机器上运行我自己的Spark群集:使用Spark安装的/sbin/目录中的启动脚本。为此,必须根据需要配置conf/spark-env.sh文件。火花不得使用127.0.0.1 IP
  • 将Scala代码编译并打包为jar(sbt包),然后将其提供给集群(请参阅Scala代码中的addJar(…)。似乎可以使用classpath/extra-classpath提供编译代码来激发灵感,但我还没有尝试过
  • 设置执行器内存和驱动程序内存(参见Scala代码)
  • spark-env.sh:

    export SPARK_MASTER_IP=192.168.1.53
    export SPARK_MASTER_PORT=7077
    export SPARK_MASTER_WEBUI_PORT=8080
    
    export SPARK_DAEMON_MEMORY=1G
    # Worker : 1 by server
    # Number of worker instances to run on each machine (default: 1). 
    # You can make this more than 1 if you have have very large machines and would like multiple Spark worker processes. 
    # If you do set this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores per worker, 
    # or else each worker will try to use all the cores.
    export SPARK_WORKER_INSTANCES=2
    # Total number of cores to allow Spark applications to use on the machine (default: all available cores).
    export SPARK_WORKER_CORES=7
    
    #Total amount of memory to allow Spark applications to use on the machine, e.g. 1000m, 2g 
    # (default: total memory minus 1 GB); 
    # note that each application's individual memory is configured using its spark.executor.memory property.
    export SPARK_WORKER_MEMORY=8G
    export SPARK_WORKER_DIR=/tmp
    
    # Executor : 1 by application run on the server
    # export SPARK_EXECUTOR_INSTANCES=4
    # export SPARK_EXECUTOR_MEMORY=4G
    
    export SPARK_SCALA_VERSION="2.10"
    
    Scala文件以运行示例:

    import org.apache.spark.SparkContext
    import org.apache.spark.SparkContext._
    import org.apache.spark.SparkConf
    import org.apache.log4j.Logger
    import org.apache.log4j.Level
    import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
    
    object SparkDemo {
    
      def log[A](key:String)(job : =>A) = {
        val start = System.currentTimeMillis
        val output = job
        println("===> %s in %s seconds"
          .format(key, (System.currentTimeMillis - start) / 1000.0))
        output
      }
    
      def main(args: Array[String]):Unit ={
    
        val modelName ="w2vModel"
    
        val sc = new SparkContext(
          new SparkConf()
          .setAppName("SparkDemo")
          .set("spark.executor.memory", "8G")
          .set("spark.driver.maxResultSize", "16G")
          .setMaster("spark://192.168.1.53:7077") // ip of the spark master.
          // .setMaster("local[2]") // does not work... workers loose contact with the master after 120s
        )
    
        // take a look into target folder if you are unsure how the jar is named
        // onliner to compile / run : sbt package && sbt run
        sc.addJar("./target/scala-2.10/sparkling_2.10-0.1.jar")
    
        val input = sc.textFile("./text8").map(line => line.split(" ").toSeq)
    
        val word2vec = new Word2Vec()
    
        val model = log("compute model") { word2vec.fit(input) }
        log ("save model") { model.save(sc, modelName) }
    
        val synonyms = model.findSynonyms("china", 40)
        for((synonym, cosineSimilarity) <- synonyms) {
          println(s"$synonym $cosineSimilarity")
        }
    
        val model2 = log("reload model") { Word2VecModel.load(sc, modelName) }
      }
    }
    
    import org.apache.spark.SparkContext
    导入org.apache.spark.SparkContext_
    导入org.apache.spark.SparkConf
    导入org.apache.log4j.Logger
    导入org.apache.log4j.Level
    导入org.apache.spark.mllib.feature.{Word2Vec,Word2VecModel}
    对象SparkDemo{
    def日志[A](键:字符串)(作业=>A)={
    val start=System.currentTimeMillis
    val输出=作业
    println(“==>%s,在%s秒内”
    .格式(键,(System.currentTimeMillis-start)/1000.0)
    输出
    }
    def main(参数:数组[字符串]):单位={
    val modelName=“w2vModel”
    val sc=新的SparkContext(
    新SparkConf()
    .setAppName(“SparkDemo”)
    .set(“spark.executor.memory”,“8G”)
    .set(“spark.driver.maxResultSize”,“16G”)
    .setMaster(“spark://192.168.1.53:7077”//spark master的ip。
    //.setMaster(“本地[2]”//不工作…工人在120秒后与主机失去联系
    )
    //如果您不确定jar是如何命名的,请查看目标文件夹
    //要编译/运行的联机程序:sbt包和sbt运行
    sc.addJar(“./target/scala-2.10/sparkling_2.10-0.1.jar”)
    val input=sc.textFile(“./text8”).map(line=>line.split(“”.toSeq)
    val word2vec=新的word2vec()
    val model=log(“计算模型”){word2vec.fit(输入)}
    日志(“保存模型”){model.save(sc,modelName)}
    val同义词=model.findSynonyms(“中国”,40)
    
    for((同义词,cosineSimilarity)
    sc.textFile
    仅在换行符上拆分,而text8不包含换行符

    您正在创建一个1行RDD。
    .map(line=>line.split(“”.toSeq)
    创建另一个类型为
    RDD[Seq[String]]
    的1行RDD


    Word2Vec在RDD的每行1句话的情况下效果最好(这也应该避免Java堆错误)。不幸的是,text8去掉了句点,因此您不能只拆分句点,但您可以找到原始版本以及用于处理它的perl脚本,编辑脚本以不删除句点并不困难。

    此文件的问题是它位于一行中。这意味着您正试图将孔线放入一行中e数据字段。这不是标记化了吗?.map(line=>line.split(“”.toSeq)没有标记化的意义。可能在一个点上拆分更表达。我也遇到过同样的情况。我尝试将此文件拆分为行,但出现了相同的错误。我使用的是spark 1.4.1
    export SPARK_MASTER_IP=192.168.1.53
    export SPARK_MASTER_PORT=7077
    export SPARK_MASTER_WEBUI_PORT=8080
    
    export SPARK_DAEMON_MEMORY=1G
    # Worker : 1 by server
    # Number of worker instances to run on each machine (default: 1). 
    # You can make this more than 1 if you have have very large machines and would like multiple Spark worker processes. 
    # If you do set this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores per worker, 
    # or else each worker will try to use all the cores.
    export SPARK_WORKER_INSTANCES=2
    # Total number of cores to allow Spark applications to use on the machine (default: all available cores).
    export SPARK_WORKER_CORES=7
    
    #Total amount of memory to allow Spark applications to use on the machine, e.g. 1000m, 2g 
    # (default: total memory minus 1 GB); 
    # note that each application's individual memory is configured using its spark.executor.memory property.
    export SPARK_WORKER_MEMORY=8G
    export SPARK_WORKER_DIR=/tmp
    
    # Executor : 1 by application run on the server
    # export SPARK_EXECUTOR_INSTANCES=4
    # export SPARK_EXECUTOR_MEMORY=4G
    
    export SPARK_SCALA_VERSION="2.10"
    
    import org.apache.spark.SparkContext
    import org.apache.spark.SparkContext._
    import org.apache.spark.SparkConf
    import org.apache.log4j.Logger
    import org.apache.log4j.Level
    import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
    
    object SparkDemo {
    
      def log[A](key:String)(job : =>A) = {
        val start = System.currentTimeMillis
        val output = job
        println("===> %s in %s seconds"
          .format(key, (System.currentTimeMillis - start) / 1000.0))
        output
      }
    
      def main(args: Array[String]):Unit ={
    
        val modelName ="w2vModel"
    
        val sc = new SparkContext(
          new SparkConf()
          .setAppName("SparkDemo")
          .set("spark.executor.memory", "8G")
          .set("spark.driver.maxResultSize", "16G")
          .setMaster("spark://192.168.1.53:7077") // ip of the spark master.
          // .setMaster("local[2]") // does not work... workers loose contact with the master after 120s
        )
    
        // take a look into target folder if you are unsure how the jar is named
        // onliner to compile / run : sbt package && sbt run
        sc.addJar("./target/scala-2.10/sparkling_2.10-0.1.jar")
    
        val input = sc.textFile("./text8").map(line => line.split(" ").toSeq)
    
        val word2vec = new Word2Vec()
    
        val model = log("compute model") { word2vec.fit(input) }
        log ("save model") { model.save(sc, modelName) }
    
        val synonyms = model.findSynonyms("china", 40)
        for((synonym, cosineSimilarity) <- synonyms) {
          println(s"$synonym $cosineSimilarity")
        }
    
        val model2 = log("reload model") { Word2VecModel.load(sc, modelName) }
      }
    }