使用Scala Apache Spark的Twitter流行标签_Scala_Twitter_Apache Spark

使用Scala Apache Spark的Twitter流行标签

scala twitter apache-spark

使用Scala Apache Spark的Twitter流行标签,scala,twitter,apache-spark,Scala,Twitter,Apache Spark,我正在尝试使用ApacheSpark和scala获得twitter流行标签。我能够打印hashtags，但是当我开始使用reduce函数计算hashtags时，我得到了以下错误 network.ConnectionManager:选择器线程被中断我在这里添加代码。请帮我解决这个问题 import java.io._ import org.apache.spark.streaming.{Seconds, StreamingContext} import StreamingContext._ im

我正在尝试使用ApacheSpark和scala获得twitter流行标签。我能够打印hashtags，但是当我开始使用reduce函数计算hashtags时，我得到了以下错误

network.ConnectionManager:选择器线程被中断
我在这里添加代码。请帮我解决这个问题

import java.io._ import org.apache.spark.streaming.{Seconds, StreamingContext} import StreamingContext._ import org.apache.spark.SparkContext._ import org.apache.spark.streaming.twitter._ object TwitterPopularTags { def main(args: Array[String]) { val (master, filters) = (args(0), args.slice(5, args.length)) // Twitter Authentication credentials System.setProperty("twitter4j.oauth.consumerKey", "****") System.setProperty("twitter4j.oauth.consumerSecret","****") System.setProperty("twitter4j.oauth.accessToken", "****") System.setProperty("twitter4j.oauth.accessTokenSecret", "****") val ssc = new StreamingContext(master, "TwitterPopularTags", Seconds(10), System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass)) val tweets = TwitterUtils.createStream(ssc, None) val statuses = tweets.map(status => status.getText()) val words = statuses.flatMap(status => status.split(" ")) val hashTags = words.filter(word => word.startsWith("#")) val counts = hashTags.map(tag => (tag, 1)) .reduceByKeyAndWindow(_ + _, _ - _, Seconds(60 * 5), Seconds(10)) counts.print() ssc.start() ssc.awaitTermination() } }
[错误]（运行main）java.lang.AssertionError:assertion失败：错误尚未设置检查点目录。请使用 StreamingContext.checkpoint（）或SparkContext.checkpoint（）来设置检查点目录。java.lang.AssertionError:断言失败：尚未设置检查点目录。请使用 StreamingContext.checkpoint（）或SparkContext.checkpoint（）来设置检查点目录。在scala.Predef$.assert（Predef.scala:179）处 org.apache.spark.streaming.dstream.dstream.validate（dstream.scala:181）在 org.apache.spark.streaming.dstream.dstream$$anonfun$validate$10.apply（dstream.scala:227）在 org.apache.spark.streaming.dstream.dstream$$anonfun$validate$10.apply（dstream.scala:227）位于scala.collection.immutable.List.foreach（List.scala:318） org.apache.spark.streaming.dstream.dstream.validate（dstream.scala:227）在 org.apache.spark.streaming.DStreamGraph$$anonfun$start$3.apply（DStreamGraph.scala:47）在 org.apache.spark.streaming.DStreamGraph$$anonfun$start$3.apply（DStreamGraph.scala:47）在 scala.collection.mutable.resizeblearray$class.foreach（resizeblearray.scala:59）位于scala.collection.mutable.ArrayBuffer.foreach（ArrayBuffer.scala:47）在 org.apache.spark.streaming.DStreamGraph.start（DStreamGraph.scala:47）在 org.apache.spark.streaming.scheduler.JobGenerator.startFirstTime（JobGenerator.scala:114）在 org.apache.spark.streaming.scheduler.JobGenerator.start（JobGenerator.scala:75）在 org.apache.spark.streaming.scheduler.JobScheduler.start（JobScheduler.scala:67）在 org.apache.spark.streaming.StreamingContext.start（StreamingContext.scala:410）位于TwitterPopularTags$.main（TwitterPopularTags.scala:77） TwitterPopularTags.main（TwitterPopularTags.scala）位于 sun.reflect.NativeMethodAccessorImpl.invoke0（本机方法）位于 invoke（NativeMethodAccessorImpl.java:57）在 sun.reflect.DelegatingMethodAccessorImpl.invoke（DelegatingMethodAccessorImpl.java:43）在java.lang.reflect.Method.invoke（Method.java:606）[trace]堆栈中禁止跟踪：运行上次编译：运行完整输出。14/11/07 20:07:43信息数据流.网络接收器$BlockGenerator:块推送线程中断14/11/07 20:07:43信息网络连接管理器：选择器线程被中断！java.lang.RuntimeException:非零退出代码：scala.sys.package$.error处的1（package.scala:27）[跟踪] 堆栈跟踪被抑制：运行上次编译：运行完整输出。 [错误]（编译：运行）非零退出代码：1[错误]总时间：41秒，已完成2014年11月7日晚上8:07:43

这是我在尝试运行上述代码时遇到的错误。
您使用的是
reduceByAndWindow
，这将强制您在Spark中激活检查点。您可以检查如何执行此单行操作
乍一看，我注意到的一件事是，您的滑动间隔小于不支持的下划线流的批量大小。你能发布你得到的全部异常吗？嗨，我已经附上了错误日志，并且更改了滑动时间间隔，现在它等于批量大小。我是spark的新手，我的目标是每10秒统计一次twitter上出现的所有推文。请帮助我以这种方式继续，以便使用优化的reduce by键和窗口，我们需要设置一个检查点目录。这样Spark就可以跟踪一些额外的状态信息。您可以调用sc.checkpoint并设置checkpoint目录，或者使用naive reducebykandwindow（省略-u部分）。