Amazon s3 使用EMR上的PySpark从S3读取SequenceFile会导致RACK_局部性

Amazon s3 使用EMR上的PySpark从S3读取SequenceFile会导致RACK_局部性,amazon-s3,apache-spark,sequencefile,Amazon S3,Apache Spark,Sequencefile,我试图使用EMR上的PySpark来分析S3上存储为SequenceFile的一些数据,但由于数据的局部性,遇到了性能问题。下面是一个非常简单的示例,但效果并不理想: seqRDD = sc.sequenceFile("s3n://<access>:<secret>@<bucket>/<table>/day=2015-07-04/hour=*/*") seqRDD.count() 运行计数时出现的一些日志仅显示2个IP: 15/07/17 23

我试图使用EMR上的PySpark来分析S3上存储为SequenceFile的一些数据,但由于数据的局部性,遇到了性能问题。下面是一个非常简单的示例,但效果并不理想:

seqRDD = sc.sequenceFile("s3n://<access>:<secret>@<bucket>/<table>/day=2015-07-04/hour=*/*")
seqRDD.count()
运行
计数时出现的一些日志仅显示2个IP:


15/07/17 23:55:28 INFO scheduler.DAGScheduler: Submitting 1354 missing tasks from Stage 1 (PythonRDD[3] at count at :1)
15/07/17 23:55:28 INFO cluster.YarnScheduler: Adding task set 1.0 with 1354 tasks
15/07/17 23:55:28 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, ip-172-31-41-210.ec2.internal, RACK_LOCAL, 1418 bytes)
15/07/17 23:55:28 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, ip-172-31-36-179.ec2.internal, RACK_LOCAL, 1420 bytes)
15/07/17 23:55:28 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-36-179.ec2.internal:39998 (size: 3.7 KB, free: 535.0 MB)
15/07/17 23:55:28 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-41-210.ec2.internal:36847 (size: 3.7 KB, free: 535.0 MB)
15/07/17 23:55:29 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-41-210.ec2.internal:36847 (size: 18.8 KB, free: 535.0 MB)
15/07/17 23:55:31 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, ip-172-31-41-210.ec2.internal, RACK_LOCAL, 1421 bytes)
15/07/17 23:55:31 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 3501 ms on ip-172-31-41-210.ec2.internal (1/1354)
15/07/17 23:55:31 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 1.0 (TID 4, ip-172-31-41-210.ec2.internal, RACK_LOCAL, 1420 bytes)
15/07/17 23:55:31 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 3) in 99 ms on ip-172-31-41-210.ec2.internal (2/1354)
15/07/17 23:55:33 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 1.0 (TID 5, ip-172-31-36-179.ec2.internal, RACK_LOCAL, 1420 bytes)
15/07/17 23:55:33 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 5190 ms on ip-172-31-36-179.ec2.internal (3/1354)
15/07/17 23:55:36 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 1.0 (TID 6, ip-172-31-41-210.ec2.internal, RACK_LOCAL, 1420 bytes)
15/07/17 23:55:36 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 1.0 (TID 4) in 4471 ms on ip-172-31-41-210.ec2.internal (4/1354)
15/07/17 23:55:37 INFO scheduler.TaskSetManager: Starting task 6.0 in stage 1.0 (TID 7, ip-172-31-36-179.ec2.internal, RACK_LOCAL, 1420 bytes)
15/07/17 23:55:37 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 1.0 (TID 5) in 3676 ms on ip-172-31-36-179.ec2.internal (5/1354)
15/07/17 23:55:40 INFO scheduler.TaskSetManager: Starting task 7.0 in stage 1.0 (TID 8, ip-172-31-41-210.ec2.internal, RACK_LOCAL, 1420 bytes)
15/07/17 23:55:40 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 1.0 (TID 6) in 3895 ms on ip-172-31-41-210.ec2.internal (6/1354)
15/07/17 23:55:40 INFO scheduler.TaskSetManager: Starting task 8.0 in stage 1.0 (TID 9, ip-172-31-36-179.ec2.internal, RACK_LOCAL, 1420 bytes)

同样值得注意的是,我使用
--master warn client
运行了
pyspark
,您是否设置了执行器的数量?spark submit的默认值为2。@噢,我有一种印象,在EMR上设置spark时,它负责在配置中配置正确数量的执行器。。。我会尝试手动设置,看看会发生什么,谢谢你的建议!这是因为在EMR上安装Spark时,我没有指定
-x
选项,因此如果您不指定该选项,EMR不会为Spark配置任何内容。奇怪的是,他们的官方文件中甚至没有注明这个选项!但是在指定它之后工作得非常好。同样值得注意的是,我使用
--master warn client
运行了
pyspark
,您是否设置了执行器的数量?spark submit的默认值为2。@噢,我有一种印象,在EMR上设置spark时,它负责在配置中配置正确数量的执行器。。。我会尝试手动设置,看看会发生什么,谢谢你的建议!这是因为在EMR上安装Spark时,我没有指定
-x
选项,因此如果您不指定该选项,EMR不会为Spark配置任何内容。奇怪的是,他们的官方文件中甚至没有注明这个选项!但在指定它之后,它可以完美地工作。

15/07/17 23:55:28 INFO scheduler.DAGScheduler: Submitting 1354 missing tasks from Stage 1 (PythonRDD[3] at count at :1)
15/07/17 23:55:28 INFO cluster.YarnScheduler: Adding task set 1.0 with 1354 tasks
15/07/17 23:55:28 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, ip-172-31-41-210.ec2.internal, RACK_LOCAL, 1418 bytes)
15/07/17 23:55:28 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, ip-172-31-36-179.ec2.internal, RACK_LOCAL, 1420 bytes)
15/07/17 23:55:28 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-36-179.ec2.internal:39998 (size: 3.7 KB, free: 535.0 MB)
15/07/17 23:55:28 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-41-210.ec2.internal:36847 (size: 3.7 KB, free: 535.0 MB)
15/07/17 23:55:29 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-41-210.ec2.internal:36847 (size: 18.8 KB, free: 535.0 MB)
15/07/17 23:55:31 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, ip-172-31-41-210.ec2.internal, RACK_LOCAL, 1421 bytes)
15/07/17 23:55:31 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 3501 ms on ip-172-31-41-210.ec2.internal (1/1354)
15/07/17 23:55:31 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 1.0 (TID 4, ip-172-31-41-210.ec2.internal, RACK_LOCAL, 1420 bytes)
15/07/17 23:55:31 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 3) in 99 ms on ip-172-31-41-210.ec2.internal (2/1354)
15/07/17 23:55:33 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 1.0 (TID 5, ip-172-31-36-179.ec2.internal, RACK_LOCAL, 1420 bytes)
15/07/17 23:55:33 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 5190 ms on ip-172-31-36-179.ec2.internal (3/1354)
15/07/17 23:55:36 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 1.0 (TID 6, ip-172-31-41-210.ec2.internal, RACK_LOCAL, 1420 bytes)
15/07/17 23:55:36 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 1.0 (TID 4) in 4471 ms on ip-172-31-41-210.ec2.internal (4/1354)
15/07/17 23:55:37 INFO scheduler.TaskSetManager: Starting task 6.0 in stage 1.0 (TID 7, ip-172-31-36-179.ec2.internal, RACK_LOCAL, 1420 bytes)
15/07/17 23:55:37 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 1.0 (TID 5) in 3676 ms on ip-172-31-36-179.ec2.internal (5/1354)
15/07/17 23:55:40 INFO scheduler.TaskSetManager: Starting task 7.0 in stage 1.0 (TID 8, ip-172-31-41-210.ec2.internal, RACK_LOCAL, 1420 bytes)
15/07/17 23:55:40 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 1.0 (TID 6) in 3895 ms on ip-172-31-41-210.ec2.internal (6/1354)
15/07/17 23:55:40 INFO scheduler.TaskSetManager: Starting task 8.0 in stage 1.0 (TID 9, ip-172-31-36-179.ec2.internal, RACK_LOCAL, 1420 bytes)