Apache spark 使用Spark(Spark SQL)2.0.0注册配置单元自定义UDF
我正在编写spark 2.0.0,其中我的要求是在sql上下文中使用'com.facebook.hive.udf.UDFNumberRows'函数,以便在其中一个查询中使用。在我的带有配置单元查询的集群中,我只是通过定义:创建临时函数myFunc作为'com.facebook.Hive.udf.UDFNumberRows'将其用作临时函数,这非常简单 我尝试按如下方式向sparkSession注册此内容,但出现错误:Apache spark 使用Spark(Spark SQL)2.0.0注册配置单元自定义UDF,apache-spark,apache-spark-sql,udf,Apache Spark,Apache Spark Sql,Udf,我正在编写spark 2.0.0,其中我的要求是在sql上下文中使用'com.facebook.hive.udf.UDFNumberRows'函数,以便在其中一个查询中使用。在我的带有配置单元查询的集群中,我只是通过定义:创建临时函数myFunc作为'com.facebook.Hive.udf.UDFNumberRows'将其用作临时函数,这非常简单 我尝试按如下方式向sparkSession注册此内容,但出现错误: sparkSession.sql("""CREATE TEMPORARY FU
sparkSession.sql("""CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows'""")
错误:
CREATE TEMPORARY FUNCTION rowsequence AS 'com.facebook.hive.udf.UDFNumberRows'
16/11/01 20:46:17 ERROR ApplicationMaster: User class threw exception: java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead.
java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead.
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeFunctionBuilder(SessionCatalog.scala:751)
at org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:61)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:186)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$.delayedEndpoint$com$mediamath$spark$attribution$sparkjob$SparkVideoCidJoin$1(SparkVideoCidJoin.scala:75)
at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$delayedInit$body.apply(SparkVideoCidJoin.scala:22)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$.main(SparkVideoCidJoin.scala:22)
at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin.main(SparkVideoCidJoin.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
您可以直接使用
SparkSession
注册自定义项,如SparkSession.UDF.register(“myUDF”,(arg1:Int,arg2:String)=>arg2+arg1)
。请参阅Spark 2.0中的详细文档
sparkSession.udf.register(...)
允许您注册Java或Scala UDF(Long=>Long类型的函数),但不允许注册处理LongWritable而不是Long的配置单元,并且可以有可变数量的参数
要注册配置单元UDF,您的第一种方法是正确的:
sparkSession.sql("""CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows'""")
但是,必须首先启用配置单元支持:
SparkSession.builder().enableHiveSupport()
并确保类路径中存在“spark hive”依赖项
说明:
您的错误消息
java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead
来自班级
打电话来,斯帕克
将用实现该方法的HiveSessionCatalog替换SessionCatalog
最后:
您要使用的UDF“com.facebook.hive.UDF.UDFNumberRows”是在hive中没有窗口功能的时候编写的。
我建议你改用它们。你可以查一下,
这或者说。您面临的问题是Spark没有在他的类路径中加载jar库 在我们的团队中,我们使用--jars选项加载外部库
/usr/bin/spark-submit --jars external_library.jar our_program.py --our_params
您可以在Spark History-Environment选项卡中检查是否正在加载外部库。(火花、纱线、二次罐)
然后,您将能够按照您所说的注册您的自定义项。一旦你像FurryMachine说的那样启用HiveSupport
sparkSession.sql("""
CREATE TEMPORARY FUNCTION myFunc AS
'com.facebook.hive.udf.UDFNumberRows'
""")
您可以在火花峰会的帮助中找到更多信息
hadoop:~/projects/neocortex/src$ spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
hadoop:~/projects/neocortex/src$spark提交——帮助
用法:spark提交[选项][应用程序参数]
用法:spark submit--kill[submission ID]--master[spark://...]
用法:spark submit--状态[提交ID]--主[spark://...]
用法:spark提交运行示例[选项]示例类[示例参数]
选项:
--主机URLspark://host:port, mesos://host:port,纱线,或本地的。
--部署模式部署\u模式是在本地(“客户端”)启动驱动程序还是
在群集中的一台工作计算机上(“群集中”)
(默认值:客户端)。
--class\u命名应用程序的主类(用于Java/Scala应用程序)。
--名称应用程序的名称。
--jars jars要包含在驱动程序中的本地jar的逗号分隔列表
和执行器类路径。
对不起,这不是对这个问题的回答,应该只是一个简单的回答comment@T.Gawęda谢谢你指出我的答案不够清楚。我花时间把它改写得更清楚。
hadoop:~/projects/neocortex/src$ spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.