Python ML函数作为pyspark-UDF

Python ML函数作为pyspark-UDF,python,pandas,apache-spark,pyspark,apache-spark-sql,Python,Pandas,Apache Spark,Pyspark,Apache Spark Sql,我对pyspark和python有点陌生。我正在尝试以pyspark UDF的形式运行ML函数 以下是一个例子: from pyspark.sql.functions import col, pandas_udf from pyspark.sql.types import StringType df = spark.createDataFrame(['Bob has a dog. He loves him'], StringType()) def parse(text): impor

我对pyspark和python有点陌生。我正在尝试以pyspark UDF的形式运行ML函数

以下是一个例子:

from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import StringType

df = spark.createDataFrame(['Bob has a dog. He loves him'], StringType())

def parse(text):
    import spacy
    import neuralcoref
    nlp = spacy.load('en_core_web_sm')
    # Let's try before using the conversion dictionary:
    neuralcoref.add_to_pipe(nlp)
    doc = nlp(text)
    return doc._.coref_resolved

 pd_udf = pandas_udf(parse, returnType=StringType())

 df.select(pd_udf(col("value"))).show()
获取此错误:

org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/user/tools/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/home/user/tools/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/user/tools/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 286, in dump_stream
    for series in iterator:
  File "<string>", line 1, in <lambda>
  File "/home/user/tools/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 101, in <lambda>
    return lambda *a: (verify_result_length(*a), arrow_return_type)
  File "/home/user/tools/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 92, in verify_result_length
    result = f(*a)
  File "/home/user/tools/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 7, in parse
  File "/home/user/anaconda3/lib/python3.7/site-packages/spacy/language.py", line 377, in __call__
    doc = self.make_doc(text)
  File "/home/user/anaconda3/lib/python3.7/site-packages/spacy/language.py", line 401, in make_doc
    return self.tokenizer(text)
TypeError: Argument 'string' has incorrect type (expected str, got Series)

org.apache.spark.api.python.python异常:回溯(最近一次调用上次):
文件“/home/user/tools/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,主文件第377行
过程()
文件“/home/user/tools/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,第372行,正在处理中
serializer.dump_流(func(拆分索引,迭代器),outfile)
文件“/home/user/tools/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py”,第286行,在dump_流中
对于迭代器中的系列:
文件“”,第1行,在
文件“/home/user/tools/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,第101行,在
返回lambda*a:(验证结果长度(*a),箭头返回类型)
文件“/home/user/tools/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,第92行,验证结果长度
结果=f(*a)
文件“/home/user/tools/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py”,第99行,在包装器中
返回f(*args,**kwargs)
文件“”,第7行,正在解析中
文件“/home/user/anaconda3/lib/python3.7/site packages/spacy/language.py”,第377行,在调用中__
单据=自制单据(文本)
文件“/home/user/anaconda3/lib/python3.7/site packages/spacy/language.py”,make_doc第401行
返回self.tokenizer(文本)
TypeError:参数“string”的类型不正确(应为str,got序列)
可以在Pyspark上运行此代码吗