Apache spark 使用Spark（Spark SQL）2.0.0注册配置单元自定义UDF_Apache Spark_Apache Spark Sql_Udf

Apache spark 使用Spark（Spark SQL）2.0.0注册配置单元自定义UDF

apache-spark

Apache spark 使用Spark（Spark SQL）2.0.0注册配置单元自定义UDF,apache-spark,apache-spark-sql,udf,Apache Spark,Apache Spark Sql,Udf,我正在编写spark 2.0.0，其中我的要求是在sql上下文中使用'com.facebook.hive.udf.UDFNumberRows'函数，以便在其中一个查询中使用。在我的带有配置单元查询的集群中，我只是通过定义：创建临时函数myFunc作为'com.facebook.Hive.udf.UDFNumberRows'将其用作临时函数，这非常简单我尝试按如下方式向sparkSession注册此内容，但出现错误： sparkSession.sql("""CREATE TEMPORARY FU

我正在编写spark 2.0.0，其中我的要求是在sql上下文中使用'com.facebook.hive.udf.UDFNumberRows'函数，以便在其中一个查询中使用。在我的带有配置单元查询的集群中，我只是通过定义：创建临时函数myFunc作为'com.facebook.Hive.udf.UDFNumberRows'将其用作临时函数，这非常简单

我尝试按如下方式向sparkSession注册此内容，但出现错误：

sparkSession.sql("""CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows'""")

错误：

CREATE TEMPORARY FUNCTION rowsequence AS 'com.facebook.hive.udf.UDFNumberRows'
16/11/01 20:46:17 ERROR ApplicationMaster: User class threw exception: java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead.
java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead.
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeFunctionBuilder(SessionCatalog.scala:751)
    at org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:61)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:186)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
    at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$.delayedEndpoint$com$mediamath$spark$attribution$sparkjob$SparkVideoCidJoin$1(SparkVideoCidJoin.scala:75)
    at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$delayedInit$body.apply(SparkVideoCidJoin.scala:22)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$.main(SparkVideoCidJoin.scala:22)
    at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin.main(SparkVideoCidJoin.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)

您可以直接使用

SparkSession

注册自定义项，如

SparkSession.UDF.register（“myUDF”，（arg1:Int，arg2:String）=>arg2+arg1）

。请参阅Spark 2.0中的详细文档

sparkSession.udf.register(...)

允许您注册Java或Scala UDF（Long=>Long类型的函数），但不允许注册处理LongWritable而不是Long的配置单元，并且可以有可变数量的参数

要注册配置单元UDF，您的第一种方法是正确的：

sparkSession.sql("""CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows'""")

但是，必须首先启用配置单元支持：

SparkSession.builder().enableHiveSupport()

并确保类路径中存在“spark hive”依赖项

说明：

您的错误消息

java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead

来自班级

打电话来，斯帕克将用实现该方法的HiveSessionCatalog替换SessionCatalog

最后：

您要使用的UDF“com.facebook.hive.UDF.UDFNumberRows”是在hive中没有窗口功能的时候编写的。我建议你改用它们。你可以查一下，

这或者说。

您面临的问题是Spark没有在他的类路径中加载jar库

在我们的团队中，我们使用--jars选项加载外部库

/usr/bin/spark-submit  --jars external_library.jar our_program.py --our_params

您可以在Spark History-Environment选项卡中检查是否正在加载外部库。（火花、纱线、二次罐）

然后，您将能够按照您所说的注册您的自定义项。一旦你像FurryMachine说的那样启用HiveSupport

sparkSession.sql(""" CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows' """)
您可以在火花峰会的帮助中找到更多信息

hadoop:~/projects/neocortex/src$ spark-submit --help Usage: spark-submit [options] <app jar | python file> [app arguments] Usage: spark-submit --kill [submission ID] --master [spark://...] Usage: spark-submit --status [submission ID] --master [spark://...] Usage: spark-submit run-example [options] example-class [example args] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). --class CLASS_NAME Your application's main class (for Java / Scala apps). --name NAME A name of your application. --jars JARS Comma-separated list of local jars to include on the driver and executor classpaths.

hadoop:~/projects/neocortex/src$spark提交——帮助用法：spark提交[选项][应用程序参数] 用法：spark submit--kill[submission ID]--master[spark://...] 用法：spark submit--状态[提交ID]--主[spark://...] 用法：spark提交运行示例[选项]示例类[示例参数] 选项： --主机URLspark://host:port, mesos://host:port，纱线，或本地的。 --部署模式部署\u模式是在本地（“客户端”）启动驱动程序还是在群集中的一台工作计算机上（“群集中”）（默认值：客户端）。 --class\u命名应用程序的主类（用于Java/Scala应用程序）。 --名称应用程序的名称。 --jars jars要包含在驱动程序中的本地jar的逗号分隔列表和执行器类路径。
对不起，这不是对这个问题的回答，应该只是一个简单的回答comment@T.Gawęda谢谢你指出我的答案不够清楚。我花时间把它改写得更清楚。
hadoop:~/projects/neocortex/src$ spark-submit --help Usage: spark-submit [options] <app jar | python file> [app arguments] Usage: spark-submit --kill [submission ID] --master [spark://...] Usage: spark-submit --status [submission ID] --master [spark://...] Usage: spark-submit run-example [options] example-class [example args] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). --class CLASS_NAME Your application's main class (for Java / Scala apps). --name NAME A name of your application. --jars JARS Comma-separated list of local jars to include on the driver and executor classpaths.