Python 将列表作为参数传递给udf方法_Python_Pyspark_Nlp_User Defined Functions_Text Processing

Python 将列表作为参数传递给udf方法

python pyspark nlp

Python 将列表作为参数传递给udf方法,python,pyspark,nlp,user-defined-functions,text-processing,Python,Pyspark,Nlp,User Defined Functions,Text Processing,使用文本处理库我想将preprocess_函数作为参数传递给preprocess_text方法使用以下示例： def preprocess_text_spark(df: SparkDataFrame, target_column: str, preprocessed_column_name: str = 'preprocessed_text'

使用文本处理库

我想将preprocess_函数作为参数传递给preprocess_text方法

使用以下示例：

def preprocess_text_spark(df: SparkDataFrame, 
                          target_column: str, 
                          preprocessed_column_name: str = 'preprocessed_text'
                         ) -> SparkDataFrame:


 """ Preprocess text in a column of a PySpark DataFrame by leveraging PySpark UDF to preprocess text in parallel """



preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuation, remove_special_character, normalize_unicode,  remove_number, remove_whitespace, remove_stopword, lemmatize_word, stem_word, check_spelling] 
_preprocess_text = udf(preprocess_text, StringType())
new_df = df.withColumn(preprocessed_column_name, _preprocess_text(df[target_column],preprocess_functions))
return new_df

这是我得到的错误：

TypeError: Invalid argument, not a string or column: [<function to_lower at 0x7f33f9a865f0>, <function remove_email at 0x7f33f9a93c20>, <function remove_url at 0x7f33f9a933b0>, <function remove_punctuation at 0x7f33f9a934d0>, <function remove_special_character at 0x7f33f9a935f0>, <function normalize_unicode at 0x7f33f9a93a70>, <function remove_number at 0x7f33f9a93170>, <function remove_whitespace at 0x7f33f9a93830>, <function remove_stopword at 0x7f33f9a93b00>, <function lemmatize_word at 0x7f33f9a8d4d0>, <function stem_word at 0x7f33f9a8d3b0>, <function check_spelling at 0x7f33f9a8d170>] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

TypeError:参数无效，不是类型为[，，，，，，]的字符串或列。对于列文字，请使用“lit”、“array”、“struct”或“create_map”函数。

我试图将预处理函数转换为数组，但没有结果

如何解决此问题？

Spark udf不能将函数作为输入，它只接受列或表示列名称的字符串。看看这里的样品