String Scala Apache Spark中的(字符串,字符串)
我有一个文件,其中每行都包含String Scala Apache Spark中的(字符串,字符串),string,scala,apache-spark,String,Scala,Apache Spark,我有一个文件,其中每行都包含(Stringx,Stringy) 我想找到整个数据集中出现的Stringy。 到目前为止,我管理的代码如下: val file = sc.textFile("s3n://bucket/test.txt") // RDD[ String ] val splitRdd = file.map(line => line.split("\t")) // RDD[ Array[ String ] val yourRdd = splitRdd.flatMap
(Stringx,Stringy)
我想找到整个数据集中出现的Stringy
。
到目前为止,我管理的代码如下:
val file = sc.textFile("s3n://bucket/test.txt") // RDD[ String ]
val splitRdd = file.map(line => line.split("\t"))
// RDD[ Array[ String ]
val yourRdd = splitRdd.flatMap(arr => {
val title = arr(0)
val text = arr(1)
val words = text.split(" ")
words.map(word => (word, title))
})
// RDD[ ( String, String ) ]
scala> val s = yourRdd.map(word => ((word, scala.math.log(N/(file.filter(_.split("\t")(1).contains(word.split(",")(1))).count)))))
<console>:31: error: value split is not a member of (String, String)
val s = yourRdd.map(word => ((word, scala.math.log(N/(file.filter(_.split("\t")(1).contains(word.split(",")(1))).count)))))
它正在word中查找值“split”(这将是一个“def split”成员)然而,word不是字符串,它是(字符串,字符串),元组没有拆分方法。我相信您的意思是执行
word.\u 1.拆分(“,”)(0)
,命令变为:
val s = yourRdd.map(word => (word, scala.math.log(N / file.filter(_.split("\t")(1).contains(word._1.split(",")(1))).count)))
编辑::
通过maasg对真正潜在问题的清晰回答,我发现我需要计算每个标题中一个单词的唯一实例。我会支持maasg的回答,但我还没有足够的代表性:(
正如在评论中提到的,在闭包中使用RDD“嵌套”在另一个RDD上是不可能的。这需要改变策略。 假设每个标题都是唯一的,并尝试使用与原始问题相同的行,这可能是消除嵌套RDD计算需要的替代方法:
val file = sc.textFile("s3n://bucket/test.txt") // RDD[ String ]
val wordByTitle = file.flatMap{line =>
val split = line.split("\t")
val title = split(0)
val words = split(1).split(" ")
words.map(w=> (w,title))
}
// we want the count of documents in which a word appears,
// this is equivalent to counting distinct (word, title) combinations.
// note that replacing the title by a hash would save considerable memory
val uniqueWordPerTitle = wordByTitle.distinct()
// now we can calculate the word frequency acros documents
val tdf = uniqueWordPerTitle.map{case (w, title) => (w,1)}.reduceByKey(_ + _)
// and the inverse document frequency per word.
val idf = tdf.map{case (word,freq) => (word, scala.math.log(N/freq))}
直接的问题是
yourRdd
的类型是(String,String)
,因此在yourRdd.map(word=>…)
word的类型是`(String,String)。越过该语法错误,该方法将不起作用,因为您正在尝试映射文件RDD,而该文件不受支持。您试图尝试什么?这看起来像TF-IDF,对吗?@maasg是的!您是对的!但我想不使用MLlib就这样做。@maasg:太棒了。您是如何得出这个结论的?“RDD转换和动作只能由驱动程序调用”——我提到这种方法是行不通的——您需要采取另一种策略。考虑首先从文本中计算所有单词。这就给出了一个<代码> RDD [单词,计数] < /代码>,您可以使用Word(0)与<代码> RDD [标题,单词] < /代码>连接。parametersizes。但是String,String不接受参数。这将不起作用。文件是RDD,RDD不能在闭包中使用,因为没有为RDD定义序列化。您的代码是正确的,但my File.filter(..)似乎有问题代码。因为它似乎运行在一个无限循环中。我正在尝试计算文件中包含单词的行数。你能帮我解决这个问题吗?更新了我的帖子。请参见上文。对不起,我天真地认为问题是关于错误的。将更新。
val sc = SparkApplicationContext.coreCtx
val N = 20
var rdd: RDD[String] = sc.parallelize(Seq("t1\thi how how you,you", "t1\tcat dog,cat,mouse how you,you"))
val splitRdd: RDD[Array[String]] = rdd.map(line => line.split("\t"))
//Uniqe words per title and then reduced by title into a count
val wordCountRdd = splitRdd.flatMap(arr =>
arr(1).split(" |,").distinct //Including a comma cause you seem to split on this later on to, but I don't think you actually need too
.map(word => (word, 1))
).reduceByKey{case (cumm, one) => cumm + one}
val s: RDD[(String, Double)] = wordCountRdd.map{ case (word, freq) => (word, scala.math.log(N / freq)) }
s.collect().map(x => println(x._1 + ", " + x._2))
val file = sc.textFile("s3n://bucket/test.txt") // RDD[ String ]
val wordByTitle = file.flatMap{line =>
val split = line.split("\t")
val title = split(0)
val words = split(1).split(" ")
words.map(w=> (w,title))
}
// we want the count of documents in which a word appears,
// this is equivalent to counting distinct (word, title) combinations.
// note that replacing the title by a hash would save considerable memory
val uniqueWordPerTitle = wordByTitle.distinct()
// now we can calculate the word frequency acros documents
val tdf = uniqueWordPerTitle.map{case (w, title) => (w,1)}.reduceByKey(_ + _)
// and the inverse document frequency per word.
val idf = tdf.map{case (word,freq) => (word, scala.math.log(N/freq))}