String Scala Apache Spark中的（字符串，字符串）_String_Scala_Apache Spark

String Scala Apache Spark中的（字符串，字符串）

string scala apache-spark

String Scala Apache Spark中的（字符串，字符串）,string,scala,apache-spark,String,Scala,Apache Spark,我有一个文件，其中每行都包含（Stringx，Stringy）我想找到整个数据集中出现的Stringy。到目前为止，我管理的代码如下： val file = sc.textFile("s3n://bucket/test.txt") // RDD[ String ] val splitRdd = file.map(line => line.split("\t")) // RDD[ Array[ String ] val yourRdd = splitRdd.flatMap

我有一个文件，其中每行都包含

（Stringx，Stringy）

我想找到整个数据集中出现的

Stringy

。到目前为止，我管理的代码如下：

val file = sc.textFile("s3n://bucket/test.txt") // RDD[ String ]
val splitRdd = file.map(line => line.split("\t"))    
    // RDD[ Array[ String ]
val yourRdd = splitRdd.flatMap(arr => {
      val title = arr(0)
      val text = arr(1)
      val words = text.split(" ")
      words.map(word => (word, title))
    })
    // RDD[ ( String, String ) ]

scala> val s = yourRdd.map(word => ((word, scala.math.log(N/(file.filter(_.split("\t")(1).contains(word.split(",")(1))).count)))))
<console>:31: error: value split is not a member of (String, String)
       val s = yourRdd.map(word => ((word, scala.math.log(N/(file.filter(_.split("\t")(1).contains(word.split(",")(1))).count)))))

它正在word中查找值“split”（这将是一个“def split”成员）然而，word不是字符串，它是（字符串，字符串），元组没有拆分方法。我相信您的意思是执行

word.\u 1.拆分（“，”）（0）

，命令变为：

val s = yourRdd.map(word => (word, scala.math.log(N / file.filter(_.split("\t")(1).contains(word._1.split(",")(1))).count)))

编辑：：

通过maasg对真正潜在问题的清晰回答，我发现我需要计算每个标题中一个单词的唯一实例。我会支持maasg的回答，但我还没有足够的代表性：(

正如在评论中提到的，在闭包中使用RDD“嵌套”在另一个RDD上是不可能的。这需要改变策略。假设每个标题都是唯一的，并尝试使用与原始问题相同的行，这可能是消除嵌套RDD计算需要的替代方法：

val file = sc.textFile("s3n://bucket/test.txt") // RDD[ String ]
val wordByTitle = file.flatMap{line => 
    val split = line.split("\t")
    val title = split(0)
    val words = split(1).split(" ")
    words.map(w=> (w,title))
}

// we want the count of documents in which a word appears, 
// this is equivalent to counting distinct (word, title) combinations.
// note that replacing the title by a hash would save considerable memory
val uniqueWordPerTitle = wordByTitle.distinct()

// now we can calculate the word frequency acros documents
val tdf = uniqueWordPerTitle.map{case (w, title) => (w,1)}.reduceByKey(_ + _)

// and the inverse document frequency per word.
val idf = tdf.map{case (word,freq) => (word, scala.math.log(N/freq))}

直接的问题是

yourRdd

的类型是

（String，String）

，因此在

yourRdd.map（word=>…）

word的类型是`（String，String）。越过该语法错误，该方法将不起作用，因为您正在尝试映射文件RDD，而该文件不受支持。您试图尝试什么？这看起来像TF-IDF，对吗？@maasg是的！您是对的！但我想不使用MLlib就这样做。@maasg:太棒了。您是如何得出这个结论的？“RDD转换和动作只能由驱动程序调用”——我提到这种方法是行不通的——您需要采取另一种策略。考虑首先从文本中计算所有单词。这就给出了一个<代码> RDD [单词，计数] < /代码>，您可以使用Word（0）与<代码> RDD [标题，单词] < /代码>连接。parametersizes。但是String，String不接受参数。这将不起作用。文件是RDD，RDD不能在闭包中使用，因为没有为RDD定义序列化。您的代码是正确的，但my File.filter（..）似乎有问题代码。因为它似乎运行在一个无限循环中。我正在尝试计算文件中包含单词的行数。你能帮我解决这个问题吗？更新了我的帖子。请参见上文。对不起，我天真地认为问题是关于错误的。将更新。

val sc = SparkApplicationContext.coreCtx
val N = 20
var rdd: RDD[String] = sc.parallelize(Seq("t1\thi how how you,you", "t1\tcat dog,cat,mouse how you,you"))
val splitRdd: RDD[Array[String]] = rdd.map(line => line.split("\t"))

//Uniqe words per title and then reduced by title into a count
val wordCountRdd = splitRdd.flatMap(arr =>
  arr(1).split(" |,").distinct //Including a comma cause you seem to split on this later on to, but I don't think you actually need too
    .map(word => (word, 1))
).reduceByKey{case (cumm, one) => cumm + one}

val s: RDD[(String, Double)] = wordCountRdd.map{ case (word, freq) => (word, scala.math.log(N / freq)) }
s.collect().map(x => println(x._1 + ", " + x._2))

val file = sc.textFile("s3n://bucket/test.txt") // RDD[ String ]
val wordByTitle = file.flatMap{line => 
    val split = line.split("\t")
    val title = split(0)
    val words = split(1).split(" ")
    words.map(w=> (w,title))
}

// we want the count of documents in which a word appears, 
// this is equivalent to counting distinct (word, title) combinations.
// note that replacing the title by a hash would save considerable memory
val uniqueWordPerTitle = wordByTitle.distinct()

// now we can calculate the word frequency acros documents
val tdf = uniqueWordPerTitle.map{case (w, title) => (w,1)}.reduceByKey(_ + _)

// and the inverse document frequency per word.
val idf = tdf.map{case (word,freq) => (word, scala.math.log(N/freq))}