Apache spark 来自RDD的火花过滤广播变量_Apache Spark

Apache spark 来自RDD的火花过滤广播变量

apache-spark

Apache spark 来自RDD的火花过滤广播变量,apache-spark,Apache Spark,我正在学习广播变量，并尝试从RDD中过滤这些变量。这对我来说是不可能的这是我的样本数据 content.txt Hello this is Rogers.com This is Bell.com Apache Spark Training This is Spark Learning Session Spark is faster than MapReduce Hello, is, this, the scala> val content = sc.textFile("FilterC

我正在学习广播变量，并尝试从RDD中过滤这些变量。这对我来说是不可能的

这是我的样本数据

content.txt

Hello this is Rogers.com
This is Bell.com
Apache Spark Training
This is Spark Learning Session
Spark is faster than MapReduce

Hello, is, this, the

scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(","))

scala> val remove = sc.textFile("FilterCount/Remove.txt")
scala> val removeRDD = remove.flatMap(x => x.split(",")).map(w => w.trim)

scala> val bRemove = sc.broadcast(removeRDD.collect().toList)

scala> val filtered = contentRDD.filter{case (word) => !bRemove.value.contains(word)}

scala> filtered.foreach(print)

remove.txt

Hello this is Rogers.com
This is Bell.com
Apache Spark Training
This is Spark Learning Session
Spark is faster than MapReduce

Hello, is, this, the

scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(","))

scala> val remove = sc.textFile("FilterCount/Remove.txt")
scala> val removeRDD = remove.flatMap(x => x.split(",")).map(w => w.trim)

scala> val bRemove = sc.broadcast(removeRDD.collect().toList)

scala> val filtered = contentRDD.filter{case (word) => !bRemove.value.contains(word)}

scala> filtered.foreach(print)

脚本

Hello this is Rogers.com
This is Bell.com
Apache Spark Training
This is Spark Learning Session
Spark is faster than MapReduce

Hello, is, this, the

scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(","))

scala> val remove = sc.textFile("FilterCount/Remove.txt")
scala> val removeRDD = remove.flatMap(x => x.split(",")).map(w => w.trim)

scala> val bRemove = sc.broadcast(removeRDD.collect().toList)

scala> val filtered = contentRDD.filter{case (word) => !bRemove.value.contains(word)}

scala> filtered.foreach(print)

你好，我是罗杰斯，我是贝尔，我是阿帕奇星火培训公司 Spark学习课程Park比MapReduce更快

如上所示，筛选列表仍然包含广播变量。我怎样才能去掉这些

这是因为您正在用“

”、“

”拆分文件，但您的文件用空格分隔

替换为

scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(" "))

使用此选项可以忽略大小写

val filtered = contentRDD.filter{case (word) =>
     !bRemove.value.map(_.toLowerCase).contains(word.toLowerCase()
)}

希望这能奏效

你确定你的支票“Hello”包含（“Hello this is Rogers.com”）正确吗？谢谢！这是一个典型的错误，我花了几个小时也没能抓住，因为我没找准地方+1用于显示忽略情况。