Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 在pyspark UDF中使用广播数据帧_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Apache spark 在pyspark UDF中使用广播数据帧

Apache spark 在pyspark UDF中使用广播数据帧,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,是否可以在pyspark SQl应用程序的UDF中使用广播数据帧 我的代码调用pyspark数据帧内的广播数据帧,如下所示 fact_ent_df_data = sparkSession.sparkContext.broadcast(fact_ent_df.collect()) def generate_lookup_code(col1,col2,col3): fact_ent_df_count=fact_ent_df_data. select(fact_

是否可以在pyspark SQl应用程序的UDF中使用广播数据帧

我的代码调用pyspark数据帧内的广播数据帧,如下所示

fact_ent_df_data = 
       sparkSession.sparkContext.broadcast(fact_ent_df.collect()) 
def generate_lookup_code(col1,col2,col3): 
     fact_ent_df_count=fact_ent_df_data.
     select(fact_ent_df_br.TheDate.between(col1,col2),
                  fact_ent_df_br.Ent.isin('col3')).count() 
     return fact_ent_df_count 
sparkSession.udf.register("generate_lookup_code" , generate_lookup_code ) 
sparkSession.sql('select sample4,generate_lookup_code(sample1,sample2,sample 3) as count_hol from table_t') 
当我使用广播的df_bc时,我在赋值错误之前使用了局部变量。谢谢你的帮助 我得到的错误是

Traceback (most recent call last):
  File "C:/Users/Vignesh/PycharmProjects/gettingstarted/aramex_transit/spark_driver.py", line 46, in <module>
    sparkSession.udf.register("generate_lookup_code" , generate_lookup_code )
  File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 323, in register
    self.sparkSession._jsparkSession.udf().registerPython(name, register_udf._judf)
  File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 148, in _judf
    self._judf_placeholder = self._create_judf()
  File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 157, in _create_judf
    wrapped_func = _wrap_function(sc, self.func, self.returnType)
  File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 33, in _wrap_function
    pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
  File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\rdd.py", line 2391, in _prepare_for_python_RDD
    pickled_command = ser.dumps(command)
  File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\serializers.py", line 575, in dumps
    return cloudpickle.dumps(obj, 2)
  File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\cloudpickle.py", line 918, in dumps
    cp.dump(obj)
  File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\cloudpickle.py", line 249, in dump
    raise pickle.PicklingError(msg)
pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o24.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
回溯(最近一次呼叫最后一次):
文件“C:/Users/Vignesh/PycharmProjects/gettingstarted/aramex_transit/spark_driver.py”,第46行,in
register(“生成查找代码”,生成查找代码)
文件“D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py”,第323行,在寄存器中
self.sparkSession.\u jsparkSession.udf().registerPython(名称,register\u udf.\u judf)
文件“D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py”,第148行,在
self.\u judf\u占位符=self.\u创建\u judf()
文件“D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py”,第157行,在创建
wrapped_func=_wrapp_函数(sc,self.func,self.returnType)
文件“D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py”,第33行,在函数中
pickled_命令,broadcast_vars,env,includes=_prepare_for_python_RDD(sc,command)
文件“D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\rdd.py”,第2391行,在“准备”python\u rdd中
pickled_command=ser.dumps(命令)
文件“D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\serializers.py”,第575行,转储
返回cloudpickle.dumps(obj,2)
文件“D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\cloudpickle.py”,第918行,转储
cp.dump(obj)
文件“D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\cloudpickle.py”,第249行,位于转储文件中
升起腌菜。腌菜错误(味精)
pickle.PicklingError:无法序列化对象:Py4JError:调用o24时出错。\uuu getnewargs\uuuu。跟踪:
py4j.Py4JException:方法_getnewargs__([])不存在
位于py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
位于py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
在py4j.Gateway.invoke处(Gateway.java:274)
位于py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
在py4j.commands.CallCommand.execute(CallCommand.java:79)
在py4j.GatewayConnection.run处(GatewayConnection.java:238)
运行(Thread.java:748)

将Spark Broadcast变量看作是一种Python简单的数据类型,如
list
,因此问题在于如何将变量传递给
UDF
函数。以下是一个例子: 假设我们有一个年龄列表
d
,以及一个包含列
name
age
的数据框。所以我们要检查每个人的年龄是否在年龄列表中

from pyspark.sql.functions import udf, col

l = [13, 21, 34] # ages list
d = [('Alice', 10), ('bob', 21)] # data frame rows

rdd = sc.parallelize(l)
b_rdd = sc.broadcast(rdd.collect()) # define broadcast variable
df = spark.createDataFrame(d , ["name", "age"])

def check_age (age, age_list):
    if age in l:
        return "true"
    return "false"
def udf_check_age(age_list):
    return udf(lambda x : check_age(x, age_list))

df.withColumn("is_age_in_list", udf_check_age(b_rdd.value)(col("age"))).show()
输出:

+-----+---+--------------+
| name|age|is_age_in_list|
+-----+---+--------------+
|Alice| 10|         false|
|  bob| 21|          true|
+-----+---+--------------+

将Spark Broadcast变量视为Python的简单数据类型,如
list
,因此问题在于如何将变量传递给
UDF
函数。以下是一个例子: 假设我们有一个年龄列表
d
,以及一个包含列
name
age
的数据框。所以我们要检查每个人的年龄是否在年龄列表中

from pyspark.sql.functions import udf, col

l = [13, 21, 34] # ages list
d = [('Alice', 10), ('bob', 21)] # data frame rows

rdd = sc.parallelize(l)
b_rdd = sc.broadcast(rdd.collect()) # define broadcast variable
df = spark.createDataFrame(d , ["name", "age"])

def check_age (age, age_list):
    if age in l:
        return "true"
    return "false"
def udf_check_age(age_list):
    return udf(lambda x : check_age(x, age_list))

df.withColumn("is_age_in_list", udf_check_age(b_rdd.value)(col("age"))).show()
输出:

+-----+---+--------------+
| name|age|is_age_in_list|
+-----+---+--------------+
|Alice| 10|         false|
|  bob| 21|          true|
+-----+---+--------------+

我只是想用一个基于Soheil答案的简单例子来做些贡献

从pyspark.sql.functions导入udf,col
def检查年龄(_年龄):
返回年龄>18岁
dict_source={“alice”:10,“bob”:21}
broadcast_dict=sc.broadcast(dict_source)#定义广播变量
rdd=sc.parallelize(列表(dict_source.keys())
结果=rdd.map(
lambda _name:check_age(broadcast_dict.value.get(_name))#在这里指定广播的var`.value`
)
打印(result.collect())

只是想根据苏赫尔的回答,用一个简单的例子来说明问题

从pyspark.sql.functions导入udf,col
def检查年龄(_年龄):
返回年龄>18岁
dict_source={“alice”:10,“bob”:21}
broadcast_dict=sc.broadcast(dict_source)#定义广播变量
rdd=sc.parallelize(列表(dict_source.keys())
结果=rdd.map(
lambda _name:check_age(broadcast_dict.value.get(_name))#在这里指定广播的var`.value`
)
打印(result.collect())

谢谢。我想从UDF内部对宽浇铸数据帧执行多列过滤器。每行值都需要用于筛选和获取计数,并在udf执行后返回。在您的示例中,我可以知道这里发生了什么udf_check_age(b_rdd.value)(col(“age”))。col(年龄)是否作为年龄列表传递?我需要传递3列中的值,在这种情况下,它必须是df.withColumn(count\u age,udf\u cheack\u age(b\u rdd.value)(col('age'),col(age2)))。你能建议吗?那么,你想用一个广播变量将3列中的值传递给一个函数并得到一个输出?例如:
def func(col1、col2、col3、b_rdd)
?错误主要是因为在UDF中使用了dataframe对象。pickle.PicklingError:无法序列化对象:Py4JError:调用o24时出错。\uuu getnewargs\uuuu。Trace:py4j.Py4JException:Method\uuuu getnewargs\uuuu([])不存在向问题添加代码,在注释中会让人困惑!我的问题是我没有清单。我有一个数据帧。当我通过数据FRA时,我发现有一些酸洗问题。谢谢。我想从UDF内部对宽浇铸数据帧执行多列过滤器。每行值都需要用于筛选和获取计数,并在udf执行后返回。在您的示例中,我可以知道这里发生了什么udf_check_age(b_rdd.value)(col(“age”))。col(年龄)是否作为年龄列表传递?我需要传递3列中的值,在这种情况下,它必须是df.withColumn(count\u age,udf\u cheack\u age(b\u rdd.value)(col('age'),col(age2)))。你能建议一下吗?那么,你想用一个广播变量将3列中的值传递给一个函数,得到一个ou