Apache spark 使用Spark(Spark SQL)2.0.0注册配置单元自定义UDF

Apache spark 使用Spark(Spark SQL)2.0.0注册配置单元自定义UDF,apache-spark,apache-spark-sql,udf,Apache Spark,Apache Spark Sql,Udf,我正在编写spark 2.0.0,其中我的要求是在sql上下文中使用'com.facebook.hive.udf.UDFNumberRows'函数,以便在其中一个查询中使用。在我的带有配置单元查询的集群中,我只是通过定义:创建临时函数myFunc作为'com.facebook.Hive.udf.UDFNumberRows'将其用作临时函数,这非常简单 我尝试按如下方式向sparkSession注册此内容,但出现错误: sparkSession.sql("""CREATE TEMPORARY FU

我正在编写spark 2.0.0,其中我的要求是在sql上下文中使用'com.facebook.hive.udf.UDFNumberRows'函数,以便在其中一个查询中使用。在我的带有配置单元查询的集群中,我只是通过定义:创建临时函数myFunc作为'com.facebook.Hive.udf.UDFNumberRows'将其用作临时函数,这非常简单

我尝试按如下方式向sparkSession注册此内容,但出现错误:

sparkSession.sql("""CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows'""")
错误:

CREATE TEMPORARY FUNCTION rowsequence AS 'com.facebook.hive.udf.UDFNumberRows'
16/11/01 20:46:17 ERROR ApplicationMaster: User class threw exception: java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead.
java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead.
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeFunctionBuilder(SessionCatalog.scala:751)
    at org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:61)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:186)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
    at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$.delayedEndpoint$com$mediamath$spark$attribution$sparkjob$SparkVideoCidJoin$1(SparkVideoCidJoin.scala:75)
    at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$delayedInit$body.apply(SparkVideoCidJoin.scala:22)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$.main(SparkVideoCidJoin.scala:22)
    at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin.main(SparkVideoCidJoin.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)

您可以直接使用
SparkSession
注册自定义项,如
SparkSession.UDF.register(“myUDF”,(arg1:Int,arg2:String)=>arg2+arg1)
。请参阅Spark 2.0中的详细文档

sparkSession.udf.register(...) 
允许您注册Java或Scala UDF(Long=>Long类型的函数),但不允许注册处理LongWritable而不是Long的配置单元,并且可以有可变数量的参数

要注册配置单元UDF,您的第一种方法是正确的:

sparkSession.sql("""CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows'""")
但是,必须首先启用配置单元支持:

SparkSession.builder().enableHiveSupport()
并确保类路径中存在“spark hive”依赖项

说明:

您的错误消息

java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead
来自班级

打电话来,斯帕克 将用实现该方法的HiveSessionCatalog替换SessionCatalog

最后:

您要使用的UDF“com.facebook.hive.UDF.UDFNumberRows”是在hive中没有窗口功能的时候编写的。 我建议你改用它们。你可以查一下,
这或者说。

您面临的问题是Spark没有在他的类路径中加载jar库

在我们的团队中,我们使用--jars选项加载外部库

/usr/bin/spark-submit  --jars external_library.jar our_program.py --our_params 
您可以在Spark History-Environment选项卡中检查是否正在加载外部库。(火花、纱线、二次罐)

然后,您将能够按照您所说的注册您的自定义项。一旦你像FurryMachine说的那样启用HiveSupport

sparkSession.sql("""
    CREATE TEMPORARY FUNCTION myFunc AS  
    'com.facebook.hive.udf.UDFNumberRows'
""")
您可以在火花峰会的帮助中找到更多信息

hadoop:~/projects/neocortex/src$ spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]   
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
hadoop:~/projects/neocortex/src$spark提交——帮助
用法:spark提交[选项][应用程序参数]
用法:spark submit--kill[submission ID]--master[spark://...]
用法:spark submit--状态[提交ID]--主[spark://...]   
用法:spark提交运行示例[选项]示例类[示例参数]
选项:
--主机URLspark://host:port, mesos://host:port,纱线,或本地的。
--部署模式部署\u模式是在本地(“客户端”)启动驱动程序还是
在群集中的一台工作计算机上(“群集中”)
(默认值:客户端)。
--class\u命名应用程序的主类(用于Java/Scala应用程序)。
--名称应用程序的名称。
--jars jars要包含在驱动程序中的本地jar的逗号分隔列表
和执行器类路径。

对不起,这不是对这个问题的回答,应该只是一个简单的回答comment@T.Gawęda谢谢你指出我的答案不够清楚。我花时间把它改写得更清楚。
hadoop:~/projects/neocortex/src$ spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]   
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.