Python 如何解决AttributeError:';RDD';对象没有属性'_获取对象id';什么时候使用自定义项?

Python 如何解决AttributeError:';RDD';对象没有属性'_获取对象id';什么时候使用自定义项?,python,dataframe,lambda,pyspark,user-defined-functions,Python,Dataframe,Lambda,Pyspark,User Defined Functions,我有下一个代码: from pyspark.sql.functions import lit from pyspark.sql.functions import UserDefinedFunction def aa(a, b): if (a == 1): return 3 else: return 6 example_dataframe = sqlContext.createDataFrame([(1, 1), (2, 2)], ['a', '

我有下一个代码:

from pyspark.sql.functions import lit
from pyspark.sql.functions import UserDefinedFunction

def aa(a, b):
    if (a == 1):
        return 3
    else:
        return 6

example_dataframe = sqlContext.createDataFrame([(1, 1), (2, 2)], ['a', 'b'])
example_dataframe.show()
af = UserDefinedFunction(lambda (line_a, line_b): aa(line_a, line_b), StringType())
a = af(example_dataframe.rdd)
print(a)
example_dataframe.withColumn('c',lit(a))
example_dataframe.show()
我想根据其他属性上的条件生成一个新列。我知道可以使用“withColumn”子句指定条件,但我想尝试使用UDF

我得到下一个错误:

Traceback (most recent call last):
File "/var/folders/vs/lk870p4x449gmqrtyz9hdry40000gn/T/zeppelin_pyspark-2901893392381883952.py", line 349, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/var/folders/vs/lk870p4x449gmqrtyz9hdry40000gn/T/zeppelin_pyspark-2901893392381883952.py", line 337, in <module>
exec(code)
File "<stdin>", line 9, in <module>
File "/Users/javier/Downloads/Apache_ZEPPELIN/zeppelin-0.7.1-bin-all/interpreter/spark/pyspark/pyspark.zip/pyspark/sql/functions.py", line 1848, in __call__
jc = self._judf.apply(_to_seq(sc, cols, _to_java_column))
File "/Users/javier/Downloads/Apache_ZEPPELIN/zeppelin-0.7.1-bin-all/interpreter/spark/pyspark/pyspark.zip/pyspark/sql/column.py", line 59, in _to_seq
cols = [converter(c) for c in cols]
File "/Users/javier/Downloads/Apache_ZEPPELIN/zeppelin-0.7.1-bin-all/interpreter/spark/pyspark/pyspark.zip/pyspark/sql/column.py", line 47, in _to_java_column
jcol = _create_column_from_name(col)
File "/Users/javier/Downloads/Apache_ZEPPELIN/zeppelin-0.7.1-bin-all/interpreter/spark/pyspark/pyspark.zip/pyspark/sql/column.py", line 40, in _create_column_from_name
return sc._jvm.functions.col(name)
File "/Users/javier/Downloads/Apache_ZEPPELIN/zeppelin-0.7.1-bin-all/interpreter/spark/pyspark/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1124, in __call__
args_command, temp_args = self._build_args(*args)
File "/Users/javier/Downloads/Apache_ZEPPELIN/zeppelin-0.7.1-bin-all/interpreter/spark/pyspark/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1094, in _build_args
[get_command_part(arg, self.pool) for arg in new_args])
File "/Users/javier/Downloads/Apache_ZEPPELIN/zeppelin-0.7.1-bin-all/interpreter/spark/pyspark/py4j-0.10.4-src.zip/py4j/protocol.py", line 289, in get_command_part
command_part = REFERENCE_TYPE + parameter._get_object_id()
AttributeError: 'RDD' object has no attribute '_get_object_id'
回溯(最近一次呼叫最后一次):
文件“/var/folders/vs/lk870p4x449gmqrtyz9hdry40000gn/T/zeppelin_pyspark-2901893392381883952.py”,第349行,在
引发异常(traceback.format_exc())
例外情况:回溯(最近一次呼叫最后一次):
文件“/var/folders/vs/lk870p4x449gmqrtyz9hdry40000gn/T/zeppelin_pyspark-2901893392381883952.py”,第337行,在
行政主任(代码)
文件“”,第9行,在
文件“/Users/javier/Downloads/Apache_-ZEPPELIN/ZEPPELIN-0.7.1-bin-all/解释器/spark/pyspark/pyspark.zip/pyspark/sql/functions.py”,第1848行,在__
jc=self._judf.apply(_to_seq(sc,cols,_to_java_列))
文件“/Users/javier/Downloads/Apache_-ZEPPELIN/ZEPPELIN-0.7.1-bin-all/解释器/spark/pyspark/pyspark.zip/pyspark/sql/column.py”,第59行,在
cols=[cols中c的转换器(c)]
文件“/Users/javier/Downloads/Apache_-ZEPPELIN/ZEPPELIN-0.7.1-bin-all/解释器/spark/pyspark/pyspark.zip/pyspark/sql/column.py”,第47行,在java_-to_列中
jcol=\u从\u名称(col)创建\u列\u
文件“/Users/javier/Downloads/Apache_-ZEPPELIN/ZEPPELIN-0.7.1-bin-all/解释器/spark/pyspark/pyspark.zip/pyspark/sql/column.py”,第40行,从名称创建列
返回sc.\u jvm.functions.col(名称)
文件“/Users/javier/Downloads/Apache_-ZEPPELIN/ZEPPELIN-0.7.1-bin-all/解释器/spark/pyspark/py4j-0.10.4-src.zip/py4j/java_-gateway.py”,第1124行,在__
args\u命令,temp\u args=self.\u build\u args(*args)
文件“/Users/javier/Downloads/Apache_-ZEPPELIN/ZEPPELIN-0.7.1-bin-all/解释器/spark/pyspark/py4j-0.10.4-src.zip/py4j/java_-gateway.py”,第1094行,在构建参数中
[为新参数中的参数获取命令部分(参数,self.pool)])
文件“/Users/javier/Downloads/Apache_-ZEPPELIN/ZEPPELIN-0.7.1-bin-all/解释器/spark/pyspark/py4j-0.10.4-src.zip/py4j/protocol.py”,第289行,在get_命令部分
命令\部分=引用\类型+参数。\获取\对象\ id()
AttributeError:“RDD”对象没有属性“\u get\u object\u id”

如何在UDF中传递属性值?

您必须传递数据帧列,而不是数据帧本身

>>> from pyspark.sql.types import *
>>> example_dataframe.show()
+---+---+
|  a|  b|
+---+---+
|  1|  1|
|  2|  2|
+---+---+
>>> af = UserDefinedFunction(lambda line_a, line_b : aa(line_a, line_b), StringType())
>>>example_dataframe.withColumn('c',af(example_dataframe['a'],example_dataframe['b'])).show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  1|  3|
|  2|  2|  6|
+---+---+---+

谢谢回答得好!