Scala Spark推特流媒体_Scala_Twitter_Apache Spark

Scala Spark推特流媒体

scala twitter apache-spark

Scala Spark推特流媒体,scala,twitter,apache-spark,Scala,Twitter,Apache Spark,我不熟悉Spark和Scala。我编写了一个程序，使用Spark Streaming在Twitter上获取标签或tweet。我的代码是： val conf = new SparkConf().setMaster("local[2]").setAppName("SparkTwitterHelloWorldExample"); val jssc = new StreamingContext(conf, new Duration(1000)); System.setProperty

我不熟悉Spark和Scala。我编写了一个程序，使用Spark Streaming在Twitter上获取标签或tweet。我的代码是：

  val conf = new SparkConf().setMaster("local[2]").setAppName("SparkTwitterHelloWorldExample");
    val jssc = new StreamingContext(conf, new Duration(1000));
    System.setProperty("twitter4j.oauth.consumerKey", consumerKey);
    System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret);
    System.setProperty("twitter4j.oauth.accessToken", accessToken);
    System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret);

  val twitterStream=TwitterUtils.createStream(jssc, None, Array("#Spark")) 

    // Without filter: Output text of all tweets
  val statuses = twitterStream.map{ status => status.getText() }
  val hashTags = statuses.filter(word => word.startsWith("#Spark"))
  val tagCounts = hashTags.window(Seconds(100), Seconds(10)).countByValue()
  hashTags.count().print();
  tagCounts.count().print();
  jssc.start();

这个代码总是打印0，我不知道为什么？如果有人知道，你能帮我吗，谢谢

我认为现在，这段代码只会查找状态以#Spark开头的tweet。除此之外，我建议将文本小写，这样你就可以找到#Spark，#Spark，#Spark等等。你能试试这个吗

val hashTags = statuses.filter(word => word.toLowerCase.contains("#Spark"))

另一个选项是首先获取状态中的所有hashtag，然后从hashtag列表继续。您可以在spark示例中找到这方面的示例：

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala

谢谢你的回答。现在，当我打印hashTags时，我得到了作业的时间，即时间：1450611281000毫秒。。你能告诉我如何获取状态吗？对于你给我的这个例子，我再次得到0。。我不知道是什么问题