Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Spark textFileStream不读取文件_Apache Spark_Pyspark - Fatal编程技术网

Apache spark Spark textFileStream不读取文件

Apache spark Spark textFileStream不读取文件,apache-spark,pyspark,Apache Spark,Pyspark,我正在努力让火花流发挥作用。但它不读取我放在目录中的任何文件。 从pyspark导入SparkContext 从pyspark.streaming导入StreamingContext if __name__ == "__main__": sc = SparkContext("local[*]", "StreamTest") ssc = StreamingContext(sc, 1) ssc.checkpoint("checkpoint") files = ssc

我正在努力让火花流发挥作用。但它不读取我放在目录中的任何文件。 从pyspark导入SparkContext 从pyspark.streaming导入StreamingContext

if __name__ == "__main__":
    sc = SparkContext("local[*]", "StreamTest")
    ssc = StreamingContext(sc, 1)
    ssc.checkpoint("checkpoint")

    files = ssc.textFileStream("file:///ApacheSpark/MLlib_testing/Streaming/data")

    words = files.flatMap(lambda line: line.split(" "))
    pairs = words.map(lambda word: (word, 1))
    wordCounts = pairs.reduceByKey(lambda x,y: x+y)
    print "Oled siin ??"

    wordCounts.pprint()

    ssc.start()
    ssc.awaitTermination()
一切正常,但不会从文件夹中读取任何文件。打印命令执行一次,即应用程序启动时。我做错了什么

在windows 10上使用Im并使用spark 1.6.2。无法使spark 2.0.0运行

编辑1 我正在添加一些控制台日志输出

16/09/07 11:36:57 INFO JobScheduler: Added jobs for time 1473237417000 ms
16/09/07 11:36:57 INFO JobGenerator: Checkpointing graph for time 1473237417000 ms
16/09/07 11:36:57 INFO DStreamGraph: Updating checkpoint data for time 1473237417000 ms
16/09/07 11:36:57 INFO DStreamGraph: Updated checkpoint data for time 1473237417000 ms
16/09/07 11:36:57 INFO CheckpointWriter: Submitted checkpoint of time 1473237417000 ms writer queue
16/09/07 11:36:57 INFO CheckpointWriter: Saving checkpoint for time 1473237417000 ms to file 'file:/C:/Users/Marko/Desktop/ApacheSpark/MLlib_testing/Streaming/checkpoint/checkpoint-1473237417000'
16/09/07 11:36:57 INFO CheckpointWriter: Deleting file:/C:/Users/Marko/Desktop/ApacheSpark/MLlib_testing/Streaming/checkpoint/checkpoint-1473233874000.bk
16/09/07 11:36:57 INFO CheckpointWriter: Checkpoint for time 1473237417000 ms saved to file 'file:/C:/Users/Marko/Desktop/ApacheSpark/MLlib_testing/Streaming/checkpoint/checkpoint-1473237417000', took 6071 bytes and 72 ms
16/09/07 11:36:57 INFO SparkContext: Starting job: runJob at PythonRDD.scala:393
16/09/07 11:36:57 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 82 bytes
16/09/07 11:36:57 INFO DAGScheduler: Got job 1 (runJob at PythonRDD.scala:393) with 3 output partitions
16/09/07 11:36:57 INFO DAGScheduler: Final stage: ResultStage 3 (runJob at PythonRDD.scala:393)
16/09/07 11:36:57 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
16/09/07 11:36:57 INFO DAGScheduler: Missing parents: List()
16/09/07 11:36:57 INFO DAGScheduler: Submitting ResultStage 3 (PythonRDD[22] at RDD at PythonRDD.scala:43), which has no missing parents
16/09/07 11:36:57 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 6.1 KB, free 15.8 KB)
16/09/07 11:36:57 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.5 KB, free 19.3 KB)
16/09/07 11:36:57 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:59483 (size: 3.5 KB, free: 511.1 MB)
16/09/07 11:36:57 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/09/07 11:36:57 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 3 (PythonRDD[22] at RDD at PythonRDD.scala:43)
16/09/07 11:36:57 INFO TaskSchedulerImpl: Adding task set 3.0 with 3 tasks
16/09/07 11:36:57 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 1986 bytes)
16/09/07 11:36:57 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 2, localhost, partition 2,PROCESS_LOCAL, 1986 bytes)
16/09/07 11:36:57 INFO TaskSetManager: Starting task 2.0 in stage 3.0 (TID 3, localhost, partition 3,PROCESS_LOCAL, 1986 bytes)
16/09/07 11:36:57 INFO Executor: Running task 0.0 in stage 3.0 (TID 1)
16/09/07 11:36:57 INFO Executor: Running task 1.0 in stage 3.0 (TID 2)
16/09/07 11:36:57 INFO Executor: Running task 2.0 in stage 3.0 (TID 3)
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 9 ms
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 10 ms
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 23 ms
16/09/07 11:36:58 INFO FileInputDStream: Finding new files took 3 ms
16/09/07 11:36:58 INFO FileInputDStream: New files at time 1473237418000 ms:

executors的任何日志输出?@LiMuBei在文件夹上添加了1次迭代的控制台日志输出。我是否添加文件总是一样?没有任何更改。您将文件放在哪个文件夹中?日志表明它会自动以用户目录作为前缀。只是为了确保这不是一件非常简单的事情。。此外,执行人的日志告诉您什么?您可以通过Spark web UI获取它们。文件夹与streaming.py文件位于同一位置。日志似乎没有给出任何错误,一切都在等待。@LiMuBei我想让文件夹只是数据,没有完整的路径。但当我放入文件:数据或类似的东西时,我得到了错误。我认为问题在于它找错了地方。但是我怎么能告诉它数据文件夹与.py文件在同一个位置?执行器有日志输出吗?@LiMuBei在文件夹上添加了1次迭代的控制台日志输出。我是否添加文件总是一样?没有任何更改。您将文件放在哪个文件夹中?日志表明它会自动以用户目录作为前缀。只是为了确保这不是一件非常简单的事情。。此外,执行人的日志告诉您什么?您可以通过Spark web UI获取它们。文件夹与streaming.py文件位于同一位置。日志似乎没有给出任何错误,一切都在等待。@LiMuBei我想让文件夹只是数据,没有完整的路径。但当我放入文件:数据或类似的东西时,我得到了错误。我认为问题在于它找错了地方。但是我如何才能告诉它数据文件夹与.py文件位于同一位置?