Scala Apache Spark中数据流的输出内容_Scala_Apache Spark

Scala Apache Spark中数据流的输出内容
scala apache-spark
Scala Apache Spark中数据流的输出内容,scala,apache-spark,Scala,Apache Spark,下面的Spark代码似乎没有对文件example.txt执行任何操作 val conf = new org.apache.spark.SparkConf() .setMaster("local") .setAppName("filter") .setSparkHome("C:\\spark\\spark-1.2.1-bin-hadoop2.4") .set("spark.executor.memory", "2g"); val ssc = new StreamingContex
下面的Spark代码似乎没有对文件
example.txt执行任何操作
val conf = new org.apache.spark.SparkConf()
  .setMaster("local")
  .setAppName("filter")
  .setSparkHome("C:\\spark\\spark-1.2.1-bin-hadoop2.4")
  .set("spark.executor.memory", "2g");

val ssc = new StreamingContext(conf, Seconds(1))
val dataFile: DStream[String] = ssc.textFileStream("C:\\example.txt")

dataFile.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate

我正在尝试使用dataFile.print（）

生成的一些输出：
15/03/12 12:23:53 INFO JobScheduler: Started JobScheduler
15/03/12 12:23:54 INFO FileInputDStream: Finding new files took 105 ms
15/03/12 12:23:54 INFO FileInputDStream: New files at time 1426163034000 ms:

15/03/12 12:23:54 INFO JobScheduler: Added jobs for time 1426163034000 ms
15/03/12 12:23:54 INFO JobScheduler: Starting job streaming job 1426163034000 ms.0 from job set of time 1426163034000 ms
-------------------------------------------
Time: 1426163034000 ms
-------------------------------------------

15/03/12 12:23:54 INFO JobScheduler: Finished job streaming job 1426163034000 ms.0 from job set of time 1426163034000 ms
15/03/12 12:23:54 INFO JobScheduler: Total delay: 0.157 s for time 1426163034000 ms (execution: 0.006 s)
15/03/12 12:23:54 INFO FileInputDStream: Cleared 0 old files that were older than 1426162974000 ms: 
15/03/12 12:23:54 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()
15/03/12 12:23:55 INFO FileInputDStream: Finding new files took 2 ms
15/03/12 12:23:55 INFO FileInputDStream: New files at time 1426163035000 ms:

15/03/12 12:23:55 INFO JobScheduler: Added jobs for time 1426163035000 ms
15/03/12 12:23:55 INFO JobScheduler: Starting job streaming job 1426163035000 ms.0 from job set of time 1426163035000 ms
-------------------------------------------
Time: 1426163035000 ms
-------------------------------------------

15/03/12 12:23:55 INFO JobScheduler: Finished job streaming job 1426163035000 ms.0 from job set of time 1426163035000 ms
15/03/12 12:23:55 INFO JobScheduler: Total delay: 0.011 s for time 1426163035000 ms (execution: 0.001 s)
15/03/12 12:23:55 INFO MappedRDD: Removing RDD 1 from persistence list
15/03/12 12:23:55 INFO BlockManager: Removing RDD 1
15/03/12 12:23:55 INFO FileInputDStream: Cleared 0 old files that were older than 1426162975000 ms: 
15/03/12 12:23:55 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()
15/03/12 12:23:56 INFO FileInputDStream: Finding new files took 3 ms
15/03/12 12:23:56 INFO FileInputDStream: New files at time 1426163036000 ms:

example.txt
的格式为：
gdaeicjdcg,194,155,98,107
jhbcfbdigg,73,20,122,172
ahdjfgccgd,28,47,40,178
afeidjjcef,105,164,37,53
afeiccfdeg,29,197,128,85
aegddbbcii,58,126,89,28
fjfdbfaeid,80,89,180,82

如打印文件所述：
/**
*打印此数据流中生成的每个RDD的前十个元素。这是一个输出
*运算符，因此此数据流将注册为输出流，并在其中具体化。
*/
这是否意味着此流已生成0 RDD？如果想要查看RDD的内容，使用ApacheSpark将使用RDD的collect函数。对于流，这些方法类似吗？简言之，如何打印到控制台的流内容
更新：
基于@0x0FFF注释更新了代码。似乎没有给出从本地文件系统读取的示例。这难道不像使用Spark core那样常见吗？在Spark core中，有从文件读取数据的明确示例
更新代码如下：
val conf = new org.apache.spark.SparkConf()
  .setMaster("local[2]")
  .setAppName("filter")
  .setSparkHome("C:\\spark\\spark-1.2.1-bin-hadoop2.4")
  .set("spark.executor.memory", "2g");

val ssc = new StreamingContext(conf, Seconds(1))
val dataFile: DStream[String] = ssc.textFileStream("file:///c:/data/")

dataFile.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate

但产出是一样的。当我将新文件添加到c:\\data
dir（与现有数据文件格式相同）时，它们不会被处理。我假设dataFile.print
应该将前10行打印到控制台
更新2：
也许这与我在Windows环境中运行此代码有关？
您误解了textFileStream
的用法。以下是Spark文档中的描述：
创建一个输入流，用于监视Hadoop兼容文件系统中的新文件，并将其作为文本文件读取（使用key作为LongWritable，value作为text，输入格式作为TextInputFormat）
因此，首先，您应该将目录传递给它，其次，这个目录应该可以从运行接收器的节点获得，因此最好使用HDFS来实现这一目的。然后，当您将一个新的文件放入此目录时，它将由函数print（）
处理，并为其打印前10行
更新：
我的代码：
[alex@sparkdemo tmp]$ pyspark --master local[2]
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
s15/03/12 06:37:49 WARN Utils: Your hostname, sparkdemo resolves to a loopback address: 127.0.0.1; using 192.168.208.133 instead (on interface eth0)
15/03/12 06:37:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.2.0
      /_/

Using Python version 2.6.6 (r266:84292, Nov 22 2013 12:16:22)
SparkContext available as sc.
>>> from pyspark.streaming import StreamingContext
>>> ssc = StreamingContext(sc, 30)
>>> dataFile = ssc.textFileStream('file:///tmp')
>>> dataFile.pprint()
>>> ssc.start()
>>> ssc.awaitTermination()
-------------------------------------------
Time: 2015-03-12 06:40:30
-------------------------------------------

-------------------------------------------
Time: 2015-03-12 06:41:00
-------------------------------------------

-------------------------------------------
Time: 2015-03-12 06:41:30
-------------------------------------------
1 2 3
4 5 6
7 8 9

-------------------------------------------
Time: 2015-03-12 06:42:00
-------------------------------------------

下面是我编写的一个自定义接收器，它侦听指定目录下的数据：
package receivers

import java.io.File
import org.apache.spark.{ SparkConf, Logging }
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.streaming.receiver.Receiver

class CustomReceiver(dir: String)
  extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {

  def onStart() {
    // Start the thread that receives data over a connection
    new Thread("File Receiver") {
      override def run() { receive() }
    }.start()
  }

  def onStop() {
    // There is nothing much to do as the thread calling receive()
    // is designed to stop by itself isStopped() returns false
  }

  def recursiveListFiles(f: File): Array[File] = {
    val these = f.listFiles
    these ++ these.filter(_.isDirectory).flatMap(recursiveListFiles)
  }

  private def receive() {

    for (f <- recursiveListFiles(new File(dir))) {

      val source = scala.io.Source.fromFile(f)
      val lines = source.getLines
      store(lines)
      source.close()
      logInfo("Stopped receiving")
      restart("Trying to connect again")

    }
  }
}

更多信息，请访问：
&
我可能发现了您的问题，您应该在日志中记录以下内容：
WARN StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.

问题是，运行spark流媒体应用程序至少需要2个内核。
因此，解决方案应该是简单地替换：
val conf = new org.apache.spark.SparkConf()
 .setMaster("local")

作者：
或者至少有一个。
因此无法使用从本地文件系统读取的数据测试Spark Streams？如果您在本地运行流，并允许它使用至少2个内核，即本地[2]，则可以，并且还指定您正在使用目录名中带有file://的本地文件系统。我在CentOS上对pyspark和local file做了完全相同的操作，并且成功了。此问题可能与Windows有关感谢您发布了有用的答案，但我将保留此问题，因为Windows不支持Spark Streams，或者我没有正确读取Windows的文件。我没有找到明确说明这一点的文档，但也没有找到详细介绍Spark Streams Windows示例的文档。很好，也许有人知道Windows的详细信息。我想你应该试试spark邮件列表user@spark.apache.org
val conf = new org.apache.spark.SparkConf()
 .setMaster("local")

val conf = new org.apache.spark.SparkConf()
  .setMaster("local[*]")