无法通过Apache NiFi中的ExecuteSpark交互式处理器提交Pyspark代码

无法通过Apache NiFi中的ExecuteSpark交互式处理器提交Pyspark代码,pyspark,apache-nifi,livy,Pyspark,Apache Nifi,Livy,我是Python和Apache生态系统的新手。我正在尝试通过ApacheNIFI中的ExecuteSpark交互处理器提交Pyspark代码。我对这里使用的任何组件都没有详细的了解,我只是在做谷歌搜索、点击和试用 通过这种方式,我成功地在EMR中配置并启动了Spark、NiFi和Livy。我可以通过Livy在交互式会话中提交Pyspark代码 但是,当我将ExecuteSparkInteractive配置为通过Livy提交Pyspark代码时,不会发生任何事情。Livy会话管理器不显示任何内容,

我是Python和Apache生态系统的新手。我正在尝试通过ApacheNIFI中的ExecuteSpark交互处理器提交Pyspark代码。我对这里使用的任何组件都没有详细的了解,我只是在做谷歌搜索、点击和试用

通过这种方式,我成功地在EMR中配置并启动了Spark、NiFi和Livy。我可以通过Livy在交互式会话中提交Pyspark代码

但是,当我将ExecuteSparkInteractive配置为通过Livy提交Pyspark代码时,不会发生任何事情。Livy会话管理器不显示任何内容,并且ExecuteSpark Interactive processor中没有可见错误

这是我对LivySessionController的配置:

这是我在ExecuteSparkInteractive属性下提交的示例代码

import random
from pyspark import SparkConf, SparkContext
#create SparkContext using standalone mode
conf = SparkConf().setMaster("local").setAppName("SimpleETL")
sc = SparkContext.getOrCreate(conf)

NUM_SAMPLES = 100000

def sample(p):
  x, y = random.random(), random.random()
  return 1 if x*x + y*y < 1 else 0

count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b)

print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
有趣的是,当我在NiFi中启用LivySessionController时,Livy UI会显示两个新会话——第一个创建的会话显示为“空闲”状态,而第二个会话(会话Id更大的会话)即使在多次刷新后仍会显示为“开始”状态。让我们分别给他们会话ID 1和2。有趣的是,会话ID2将状态从“开始”更改为“关闭”,再更改为“死亡”。一旦它死了,就会创建一个新会话(会话Id 3),其状态为“开始”,随后变为“空闲”。以下是这3次会议的日志摘录:

#Livy 1st session:
18/07/18 06:33:58 ERROR YarnClientSchedulerBackend: Yarn application has already exited with state FAILED!
18/07/18 06:33:58 INFO SparkUI: Stopped Spark web UI at http://ip-172-31-84-145.ec2.internal:4040
18/07/18 06:33:58 INFO YarnClientSchedulerBackend: Shutting down all executors
18/07/18 06:33:58 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
18/07/18 06:33:58 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
18/07/18 06:33:58 INFO YarnClientSchedulerBackend: Stopped
18/07/18 06:33:58 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/07/18 06:33:59 INFO MemoryStore: MemoryStore cleared
18/07/18 06:33:59 INFO BlockManager: BlockManager stopped
18/07/18 06:33:59 INFO BlockManagerMaster: BlockManagerMaster stopped
18/07/18 06:33:59 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/07/18 06:33:59 INFO SparkContext: Successfully stopped SparkContext

#Livy 2nd session:
18/07/18 06:34:30 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.

#Livy 3rd session:
18/07/18 06:36:15 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
这里没有什么-

Livy会话控制器:- 确保启用控制器时每个节点可以看到2个会话 spark UI上的服务和两个会话必须处于运行状态(但 在运行带有Nifi的python代码之前不执行任何操作)。 如果你看到不寻常的行为,那么首先集中精力解决它。 可能的操作-添加标准SSLContextService控制器和设置密钥库 和信任商店。并在LivySessionController中使用相同的设置(在属性:SSL上下文服务下)

在Python代码中: 我认为您不必导入SparkConf、SparkContext,也不需要创建conf和sc。您只需要导入Sparksession,如下所示- 从pyspark.sql导入SparkSession

您可以简单地使用spark(默认情况下,它作为spark会话变量提供) e、 g-spark.sql(s“…slq语句…”)或sc的spark.sparkContext

您提到的最后一件事是“Livy会话管理器没有显示任何内容,并且ExecuteSpark交互处理器中没有可见的错误。” 为此,您可以在ExecuteSpark Interactive processor之后添加一些虚拟处理器,如updateAttribute,并将其保持在禁用模式。此外,您还必须在所有3种状态(成功、失败、等待)下将spark interactive processor的输出定向到updateAttribute。通过这种方式,您将能够看到在nifi中运行pyspark代码后的结果。有关示例,请参阅下图

我希望这将帮助您解决您的问题

如果您喜欢答案,请向上投票


启动处理器时,您是否在
nifi app.log
文件中看到任何内容?我添加了来自nifi和Livy的日志摘录。
#After starting the processor
2018-07-18 06:38:11,768 INFO [NiFi Web Server-112] o.a.n.c.s.StandardProcessScheduler Starting ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7]
2018-07-18 06:38:11,770 INFO [Monitor Processore Lifecycle Thread-1] o.a.n.c.s.TimerDrivenSchedulingAgent Scheduled ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7] to run with 1 threads
2018-07-18 06:38:11,883 INFO [Flow Service Tasks Thread-1] o.a.nifi.controller.StandardFlowService Saved flow controller org.apache.nifi.controller.FlowController@36fb0996 // Another save pending = false
2018-07-18 06:38:57,106 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog@12830e23 checkpointed with 0 Records and 0 Swap Files in 7 milliseconds (Stop-the-world time = 2 milliseconds, Clear Edit Logs time = 2 millis), max Transaction ID -1

#After stopping the processor
2018-07-18 06:39:09,835 INFO [NiFi Web Server-106] o.a.n.c.s.StandardProcessScheduler Stopping ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7]
2018-07-18 06:39:09,835 INFO [NiFi Web Server-106] o.a.n.controller.StandardProcessorNode Stopping processor: class org.apache.nifi.processors.livy.ExecuteSparkInteractive
2018-07-18 06:39:09,838 INFO [Timer-Driven Process Thread-9] o.a.n.c.s.TimerDrivenSchedulingAgent Stopped scheduling ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7] to run
2018-07-18 06:39:09,917 INFO [Flow Service Tasks Thread-2] o.a.nifi.controller.StandardFlowService Saved flow controller org.apache.nifi.controller.FlowController@36fb0996 // Another save pending = false
#Livy 1st session:
18/07/18 06:33:58 ERROR YarnClientSchedulerBackend: Yarn application has already exited with state FAILED!
18/07/18 06:33:58 INFO SparkUI: Stopped Spark web UI at http://ip-172-31-84-145.ec2.internal:4040
18/07/18 06:33:58 INFO YarnClientSchedulerBackend: Shutting down all executors
18/07/18 06:33:58 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
18/07/18 06:33:58 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
18/07/18 06:33:58 INFO YarnClientSchedulerBackend: Stopped
18/07/18 06:33:58 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/07/18 06:33:59 INFO MemoryStore: MemoryStore cleared
18/07/18 06:33:59 INFO BlockManager: BlockManager stopped
18/07/18 06:33:59 INFO BlockManagerMaster: BlockManagerMaster stopped
18/07/18 06:33:59 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/07/18 06:33:59 INFO SparkContext: Successfully stopped SparkContext

#Livy 2nd session:
18/07/18 06:34:30 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.

#Livy 3rd session:
18/07/18 06:36:15 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.