Apache spark Spark textFileStream不读取文件_Apache Spark_Pyspark

Apache spark Spark textFileStream不读取文件

apache-spark pyspark

Apache spark Spark textFileStream不读取文件,apache-spark,pyspark,Apache Spark,Pyspark,我正在努力让火花流发挥作用。但它不读取我放在目录中的任何文件。从pyspark导入SparkContext 从pyspark.streaming导入StreamingContext if __name__ == "__main__": sc = SparkContext("local[*]", "StreamTest") ssc = StreamingContext(sc, 1) ssc.checkpoint("checkpoint") files = ssc

我正在努力让火花流发挥作用。但它不读取我放在目录中的任何文件。从pyspark导入SparkContext 从pyspark.streaming导入StreamingContext

if __name__ == "__main__":
    sc = SparkContext("local[*]", "StreamTest")
    ssc = StreamingContext(sc, 1)
    ssc.checkpoint("checkpoint")

    files = ssc.textFileStream("file:///ApacheSpark/MLlib_testing/Streaming/data")

    words = files.flatMap(lambda line: line.split(" "))
    pairs = words.map(lambda word: (word, 1))
    wordCounts = pairs.reduceByKey(lambda x,y: x+y)
    print "Oled siin ??"

    wordCounts.pprint()

    ssc.start()
    ssc.awaitTermination()

一切正常，但不会从文件夹中读取任何文件。打印命令执行一次，即应用程序启动时。我做错了什么

在windows 10上使用Im并使用spark 1.6.2。无法使spark 2.0.0运行

编辑1 我正在添加一些控制台日志输出

16/09/07 11:36:57 INFO JobScheduler: Added jobs for time 1473237417000 ms
16/09/07 11:36:57 INFO JobGenerator: Checkpointing graph for time 1473237417000 ms
16/09/07 11:36:57 INFO DStreamGraph: Updating checkpoint data for time 1473237417000 ms
16/09/07 11:36:57 INFO DStreamGraph: Updated checkpoint data for time 1473237417000 ms
16/09/07 11:36:57 INFO CheckpointWriter: Submitted checkpoint of time 1473237417000 ms writer queue
16/09/07 11:36:57 INFO CheckpointWriter: Saving checkpoint for time 1473237417000 ms to file 'file:/C:/Users/Marko/Desktop/ApacheSpark/MLlib_testing/Streaming/checkpoint/checkpoint-1473237417000'
16/09/07 11:36:57 INFO CheckpointWriter: Deleting file:/C:/Users/Marko/Desktop/ApacheSpark/MLlib_testing/Streaming/checkpoint/checkpoint-1473233874000.bk
16/09/07 11:36:57 INFO CheckpointWriter: Checkpoint for time 1473237417000 ms saved to file 'file:/C:/Users/Marko/Desktop/ApacheSpark/MLlib_testing/Streaming/checkpoint/checkpoint-1473237417000', took 6071 bytes and 72 ms
16/09/07 11:36:57 INFO SparkContext: Starting job: runJob at PythonRDD.scala:393
16/09/07 11:36:57 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 82 bytes
16/09/07 11:36:57 INFO DAGScheduler: Got job 1 (runJob at PythonRDD.scala:393) with 3 output partitions
16/09/07 11:36:57 INFO DAGScheduler: Final stage: ResultStage 3 (runJob at PythonRDD.scala:393)
16/09/07 11:36:57 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
16/09/07 11:36:57 INFO DAGScheduler: Missing parents: List()
16/09/07 11:36:57 INFO DAGScheduler: Submitting ResultStage 3 (PythonRDD[22] at RDD at PythonRDD.scala:43), which has no missing parents
16/09/07 11:36:57 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 6.1 KB, free 15.8 KB)
16/09/07 11:36:57 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.5 KB, free 19.3 KB)
16/09/07 11:36:57 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:59483 (size: 3.5 KB, free: 511.1 MB)
16/09/07 11:36:57 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/09/07 11:36:57 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 3 (PythonRDD[22] at RDD at PythonRDD.scala:43)
16/09/07 11:36:57 INFO TaskSchedulerImpl: Adding task set 3.0 with 3 tasks
16/09/07 11:36:57 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 1986 bytes)
16/09/07 11:36:57 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 2, localhost, partition 2,PROCESS_LOCAL, 1986 bytes)
16/09/07 11:36:57 INFO TaskSetManager: Starting task 2.0 in stage 3.0 (TID 3, localhost, partition 3,PROCESS_LOCAL, 1986 bytes)
16/09/07 11:36:57 INFO Executor: Running task 0.0 in stage 3.0 (TID 1)
16/09/07 11:36:57 INFO Executor: Running task 1.0 in stage 3.0 (TID 2)
16/09/07 11:36:57 INFO Executor: Running task 2.0 in stage 3.0 (TID 3)
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 9 ms
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 10 ms
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 23 ms
16/09/07 11:36:58 INFO FileInputDStream: Finding new files took 3 ms
16/09/07 11:36:58 INFO FileInputDStream: New files at time 1473237418000 ms:

executors的任何日志输出？@LiMuBei在文件夹上添加了1次迭代的控制台日志输出。我是否添加文件总是一样？没有任何更改。您将文件放在哪个文件夹中？日志表明它会自动以用户目录作为前缀。只是为了确保这不是一件非常简单的事情。。此外，执行人的日志告诉您什么？您可以通过Spark web UI获取它们。文件夹与streaming.py文件位于同一位置。日志似乎没有给出任何错误，一切都在等待。@LiMuBei我想让文件夹只是数据，没有完整的路径。但当我放入文件：数据或类似的东西时，我得到了错误。我认为问题在于它找错了地方。但是我怎么能告诉它数据文件夹与.py文件在同一个位置？执行器有日志输出吗？@LiMuBei在文件夹上添加了1次迭代的控制台日志输出。我是否添加文件总是一样？没有任何更改。您将文件放在哪个文件夹中？日志表明它会自动以用户目录作为前缀。只是为了确保这不是一件非常简单的事情。。此外，执行人的日志告诉您什么？您可以通过Spark web UI获取它们。文件夹与streaming.py文件位于同一位置。日志似乎没有给出任何错误，一切都在等待。@LiMuBei我想让文件夹只是数据，没有完整的路径。但当我放入文件：数据或类似的东西时，我得到了错误。我认为问题在于它找错了地方。但是我如何才能告诉它数据文件夹与.py文件位于同一位置？