Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/string/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
String Scala Apache Spark中的(字符串,字符串)_String_Scala_Apache Spark - Fatal编程技术网

String Scala Apache Spark中的(字符串,字符串)

String Scala Apache Spark中的(字符串,字符串),string,scala,apache-spark,String,Scala,Apache Spark,我有一个文件,其中每行都包含(Stringx,Stringy) 我想找到整个数据集中出现的Stringy。 到目前为止,我管理的代码如下: val file = sc.textFile("s3n://bucket/test.txt") // RDD[ String ] val splitRdd = file.map(line => line.split("\t")) // RDD[ Array[ String ] val yourRdd = splitRdd.flatMap

我有一个文件,其中每行都包含
(Stringx,Stringy)

我想找到整个数据集中出现的
Stringy
。 到目前为止,我管理的代码如下:

val file = sc.textFile("s3n://bucket/test.txt") // RDD[ String ]
val splitRdd = file.map(line => line.split("\t"))    
    // RDD[ Array[ String ]
val yourRdd = splitRdd.flatMap(arr => {
      val title = arr(0)
      val text = arr(1)
      val words = text.split(" ")
      words.map(word => (word, title))
    })
    // RDD[ ( String, String ) ]

scala> val s = yourRdd.map(word => ((word, scala.math.log(N/(file.filter(_.split("\t")(1).contains(word.split(",")(1))).count)))))
<console>:31: error: value split is not a member of (String, String)
       val s = yourRdd.map(word => ((word, scala.math.log(N/(file.filter(_.split("\t")(1).contains(word.split(",")(1))).count)))))

它正在word中查找值“split”(这将是一个“def split”成员)然而,word不是字符串,它是(字符串,字符串),元组没有拆分方法。我相信您的意思是执行
word.\u 1.拆分(“,”)(0)
,命令变为:

val s = yourRdd.map(word => (word, scala.math.log(N / file.filter(_.split("\t")(1).contains(word._1.split(",")(1))).count)))
编辑::

通过maasg对真正潜在问题的清晰回答,我发现我需要计算每个标题中一个单词的唯一实例。我会支持maasg的回答,但我还没有足够的代表性:(


正如在评论中提到的,在闭包中使用RDD“嵌套”在另一个RDD上是不可能的。这需要改变策略。 假设每个标题都是唯一的,并尝试使用与原始问题相同的行,这可能是消除嵌套RDD计算需要的替代方法:

val file = sc.textFile("s3n://bucket/test.txt") // RDD[ String ]
val wordByTitle = file.flatMap{line => 
    val split = line.split("\t")
    val title = split(0)
    val words = split(1).split(" ")
    words.map(w=> (w,title))
}

// we want the count of documents in which a word appears, 
// this is equivalent to counting distinct (word, title) combinations.
// note that replacing the title by a hash would save considerable memory
val uniqueWordPerTitle = wordByTitle.distinct()

// now we can calculate the word frequency acros documents
val tdf = uniqueWordPerTitle.map{case (w, title) => (w,1)}.reduceByKey(_ + _)

// and the inverse document frequency per word.
val idf = tdf.map{case (word,freq) => (word, scala.math.log(N/freq))}

直接的问题是
yourRdd
的类型是
(String,String)
,因此在
yourRdd.map(word=>…)
word的类型是`(String,String)。越过该语法错误,该方法将不起作用,因为您正在尝试映射文件RDD,而该文件不受支持。您试图尝试什么?这看起来像TF-IDF,对吗?@maasg是的!您是对的!但我想不使用MLlib就这样做。@maasg:太棒了。您是如何得出这个结论的?“RDD转换和动作只能由驱动程序调用”——我提到这种方法是行不通的——您需要采取另一种策略。考虑首先从文本中计算所有单词。这就给出了一个<代码> RDD [单词,计数] < /代码>,您可以使用Word(0)与<代码> RDD [标题,单词] < /代码>连接。parametersizes。但是String,String不接受参数。这将不起作用。文件是RDD,RDD不能在闭包中使用,因为没有为RDD定义序列化。您的代码是正确的,但my File.filter(..)似乎有问题代码。因为它似乎运行在一个无限循环中。我正在尝试计算文件中包含单词的行数。你能帮我解决这个问题吗?更新了我的帖子。请参见上文。对不起,我天真地认为问题是关于错误的。将更新。
val sc = SparkApplicationContext.coreCtx
val N = 20
var rdd: RDD[String] = sc.parallelize(Seq("t1\thi how how you,you", "t1\tcat dog,cat,mouse how you,you"))
val splitRdd: RDD[Array[String]] = rdd.map(line => line.split("\t"))

//Uniqe words per title and then reduced by title into a count
val wordCountRdd = splitRdd.flatMap(arr =>
  arr(1).split(" |,").distinct //Including a comma cause you seem to split on this later on to, but I don't think you actually need too
    .map(word => (word, 1))
).reduceByKey{case (cumm, one) => cumm + one}

val s: RDD[(String, Double)] = wordCountRdd.map{ case (word, freq) => (word, scala.math.log(N / freq)) }
s.collect().map(x => println(x._1 + ", " + x._2))
val file = sc.textFile("s3n://bucket/test.txt") // RDD[ String ]
val wordByTitle = file.flatMap{line => 
    val split = line.split("\t")
    val title = split(0)
    val words = split(1).split(" ")
    words.map(w=> (w,title))
}

// we want the count of documents in which a word appears, 
// this is equivalent to counting distinct (word, title) combinations.
// note that replacing the title by a hash would save considerable memory
val uniqueWordPerTitle = wordByTitle.distinct()

// now we can calculate the word frequency acros documents
val tdf = uniqueWordPerTitle.map{case (w, title) => (w,1)}.reduceByKey(_ + _)

// and the inverse document frequency per word.
val idf = tdf.map{case (word,freq) => (word, scala.math.log(N/freq))}