Python 3.x pyspark3.0.0中的pandas_udf出现意外错误
我遵循了中的示例,但它出错了,我的代码如下所示:Python 3.x pyspark3.0.0中的pandas_udf出现意外错误,python-3.x,pyspark,apache-spark-sql,Python 3.x,Pyspark,Apache Spark Sql,我遵循了中的示例,但它出错了,我的代码如下所示: import pandas as pd from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession from pyspark.sql.pandas.functions import pandas_udf class SparkBase(object): def __init__(self, master="local[*]
import pandas as pd
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.pandas.functions import pandas_udf
class SparkBase(object):
def __init__(self, master="local[*]", app_name="SparkBase"):
_conf = SparkConf().setMaster(master).setAppName(app_name)
_conf.set("spark.sql.execution.arrow.pyspark.enabled", True)
_conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", True)
self.sc = SparkContext().getOrCreate(conf=_conf)
self.spark = SparkSession.builder.config(conf=_conf).enableHiveSupport().getOrCreate()
@pandas_udf("col1 string, col2 long")
def func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) -> pd.DataFrame:
s3["col2"] = s1 + s2.str.len()
return s3
if __name__ == "__main__":
spark_base = SparkBase()
df = spark_base.spark.createDataFrame([[1, "a string", ("a nested string",)]],
"long_c long, str_c string, struct_c struct<col1: string>")
df.show()
将熊猫作为pd导入
从pyspark导入SparkConf,SparkContext
从pyspark.sql导入SparkSession
从pyspark.sql.pandas.functions导入pandas\u udf
类SparkBase(对象):
定义初始化(self,master=“local[*]”,app_name=“SparkBase”):
_conf=SparkConf().setMaster(master).setAppName(app_name)
_conf.set(“spark.sql.execution.arrow.pyspark.enabled”,True)
_conf.set(“spark.sql.execution.arrow.pyspark.fallback.enabled”,True)
self.sc=SparkContext().getOrCreate(conf=_conf)
self.spark=SparkSession.builder.config(conf=_conf).enablehavesupport().getOrCreate()
@pandas_udf(“col1字符串,col2长”)
def func(s1:pd.Series,s2:pd.Series,s3:pd.DataFrame)->pd.DataFrame:
s3[“col2”]=s1+s2.str.len()
返回s3
如果名称=“\uuuuu main\uuuuuuuu”:
spark_base=SparkBase()
df=spark_base.spark.createDataFrame([[1,“一个字符串”(“一个嵌套字符串”)]),
“long_c long,str_c string,struct_c struct”)
df.show()
错误代码:
Traceback (most recent call last):
File "F:/otherproj/localpyspark/pyspark3/sparkbase.py", line 24, in <module>
def func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) -> pd.DataFrame:
File "D:\ProgramData\Anaconda3\lib\site-packages\pyspark\sql\pandas\functions.py", line 426, in _create_pandas_udf
return _create_udf(f, returnType, evalType)
File "D:\ProgramData\Anaconda3\lib\site-packages\pyspark\sql\udf.py", line 43, in _create_udf
return udf_obj._wrapped()
File "D:\ProgramData\Anaconda3\lib\site-packages\pyspark\sql\udf.py", line 204, in _wrapped
wrapper.returnType = self.returnType
File "D:\ProgramData\Anaconda3\lib\site-packages\pyspark\sql\udf.py", line 94, in returnType
self._returnType_placeholder = _parse_datatype_string(self._returnType)
File "D:\ProgramData\Anaconda3\lib\site-packages\pyspark\sql\types.py", line 822, in _parse_datatype_string
raise e
File "D:\ProgramData\Anaconda3\lib\site-packages\pyspark\sql\types.py", line 812, in _parse_datatype_string
return from_ddl_schema(s)
File "D:\ProgramData\Anaconda3\lib\site-packages\pyspark\sql\types.py", line 804, in from_ddl_schema
sc._jvm.org.apache.spark.sql.types.StructType.fromDDL(type_str).json())
AttributeError: 'NoneType' object has no attribute '_jvm'
回溯(最近一次呼叫最后一次):
文件“F:/otherproj/localpyspark/pyspark3/sparkbase.py”,第24行,在
def func(s1:pd.Series,s2:pd.Series,s3:pd.DataFrame)->pd.DataFrame:
文件“D:\ProgramData\Anaconda3\lib\site packages\pyspark\sql\pandas\functions.py”,第426行,位于\u create\u pandas\u udf中
return\u create\u udf(f,returnType,evalType)
文件“D:\ProgramData\Anaconda3\lib\site packages\pyspark\sql\udf.py”,第43行,在创建udf
返回udf_obj._wrapped()
文件“D:\ProgramData\Anaconda3\lib\site packages\pyspark\sql\udf.py”,第204行,用_包装
wrapper.returnType=self.returnType
文件“D:\ProgramData\Anaconda3\lib\site packages\pyspark\sql\udf.py”,第94行,返回类型
self.\u returnType\u占位符=\u parse\u datatype\u string(self.\u returnType)
文件“D:\ProgramData\Anaconda3\lib\site packages\pyspark\sql\types.py”,第822行,在\u parse\u datatype\u字符串中
提高e
文件“D:\ProgramData\Anaconda3\lib\site packages\pyspark\sql\types.py”,第812行,位于\u parse\u datatype\u字符串中
从_ddl_模式返回
文件“D:\ProgramData\Anaconda3\lib\site packages\pyspark\sql\types.py”,第804行,在from\u ddl\u模式中
sc._jvm.org.apache.spark.sql.types.StructType.fromdll(type_str.json())
AttributeError:“非类型”对象没有属性“\u jvm”
如果我注释func
函数,它可以成功运行。哪里出了问题?它是spark3.0.0中的一个bug吗