Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/wix/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 来自RDD的火花过滤广播变量_Apache Spark - Fatal编程技术网

Apache spark 来自RDD的火花过滤广播变量

Apache spark 来自RDD的火花过滤广播变量,apache-spark,Apache Spark,我正在学习广播变量,并尝试从RDD中过滤这些变量。这对我来说是不可能的 这是我的样本数据 content.txt Hello this is Rogers.com This is Bell.com Apache Spark Training This is Spark Learning Session Spark is faster than MapReduce Hello, is, this, the scala> val content = sc.textFile("FilterC

我正在学习广播变量,并尝试从RDD中过滤这些变量。这对我来说是不可能的

这是我的样本数据

content.txt

Hello this is Rogers.com
This is Bell.com
Apache Spark Training
This is Spark Learning Session
Spark is faster than MapReduce
Hello, is, this, the
scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(","))

scala> val remove = sc.textFile("FilterCount/Remove.txt")
scala> val removeRDD = remove.flatMap(x => x.split(",")).map(w => w.trim)

scala> val bRemove = sc.broadcast(removeRDD.collect().toList)

scala> val filtered = contentRDD.filter{case (word) => !bRemove.value.contains(word)}

scala> filtered.foreach(print)
remove.txt

Hello this is Rogers.com
This is Bell.com
Apache Spark Training
This is Spark Learning Session
Spark is faster than MapReduce
Hello, is, this, the
scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(","))

scala> val remove = sc.textFile("FilterCount/Remove.txt")
scala> val removeRDD = remove.flatMap(x => x.split(",")).map(w => w.trim)

scala> val bRemove = sc.broadcast(removeRDD.collect().toList)

scala> val filtered = contentRDD.filter{case (word) => !bRemove.value.contains(word)}

scala> filtered.foreach(print)
脚本

Hello this is Rogers.com
This is Bell.com
Apache Spark Training
This is Spark Learning Session
Spark is faster than MapReduce
Hello, is, this, the
scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(","))

scala> val remove = sc.textFile("FilterCount/Remove.txt")
scala> val removeRDD = remove.flatMap(x => x.split(",")).map(w => w.trim)

scala> val bRemove = sc.broadcast(removeRDD.collect().toList)

scala> val filtered = contentRDD.filter{case (word) => !bRemove.value.contains(word)}

scala> filtered.foreach(print)
你好,我是罗杰斯,我是贝尔,我是阿帕奇星火培训公司 Spark学习课程Park比MapReduce更快


如上所示,筛选列表仍然包含广播变量。我怎样才能去掉这些

这是因为您正在用“
”、“
”拆分文件,但您的文件用空格分隔

替换为

scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(" "))
使用此选项可以忽略大小写

val filtered = contentRDD.filter{case (word) =>
     !bRemove.value.map(_.toLowerCase).contains(word.toLowerCase()
)}

希望这能奏效

你确定你的支票“Hello”包含(“Hello this is Rogers.com”)正确吗?谢谢!这是一个典型的错误,我花了几个小时也没能抓住,因为我没找准地方+1用于显示忽略情况。