为什么在运行pandas_udf时Pypark失败?
在PySpark中运行pandas UDF时出现此错误。这是使用外部库的UDF: 然后我注册函数:为什么在运行pandas_udf时Pypark失败?,pandas,apache-spark,pyspark,pyarrow,Pandas,Apache Spark,Pyspark,Pyarrow,在PySpark中运行pandas UDF时出现此错误。这是使用外部库的UDF: 然后我注册函数: algoritmos_comparacion_udf = f.pandas_udf(algoritmos_comparacion, StringType()) 最后,我使用这个udf: df.withColumn("hamming", algoritmos_comparacion_udf(f.col("num_serie_exp"), f.col(&quo
algoritmos_comparacion_udf = f.pandas_udf(algoritmos_comparacion, StringType())
最后,我使用这个udf:
df.withColumn("hamming", algoritmos_comparacion_udf(f.col("num_serie_exp"), f.col("num_serie_rec")))
我已经安装了pandas和pyarrow版本0.8.0。我得到了这个错误:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/DATOS/var/log/hadoop/yarn/local/usercache/bguser/appcache/application_1563894657824_0447/container_e66_1563894657824_0447_01_000002/pyspark.zip/pyspark/worker.py", line 235, in main
process()
File "/DATOS/var/log/hadoop/yarn/local/usercache/bguser/appcache/application_1563894657824_0447/container_e66_1563894657824_0447_01_000002/pyspark.zip/pyspark/worker.py", line 230, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/DATOS/var/log/hadoop/yarn/local/usercache/bguser/appcache/application_1563894657824_0447/container_e66_1563894657824_0447_01_000002/pyspark.zip/pyspark/serializers.py", line 267, in dump_stream
for series in iterator:
File "<string>", line 1, in <lambda>
File "/DATOS/var/log/hadoop/yarn/local/usercache/bguser/appcache/application_1563894657824_0447/container_e66_1563894657824_0447_01_000002/pyspark.zip/pyspark/worker.py", line 92, in <lambda>
return lambda *a: (verify_result_length(*a), arrow_return_type)
File "/DATOS/var/log/hadoop/yarn/local/usercache/bguser/appcache/application_1563894657824_0447/container_e66_1563894657824_0447_01_000002/pyspark.zip/pyspark/worker.py", line 83, in verify_result_length
result = f(*a)
File "/DATOS/var/log/hadoop/yarn/local/usercache/bguser/appcache/application_1563894657824_0447/container_e66_1563894657824_0447_01_000002/pyspark.zip/pyspark/util.py", line 55, in wrapper
return f(*args, **kwargs)
File "/home/bguser/SII-IVA/jobs/caso3/caso3.py", line 39, in algoritmos_comparacion
File "/DATOS/var/log/hadoop/yarn/local/usercache/bguser/appcache/application_1563894657824_0447/container_e66_1563894657824_0447_01_000002/virtualenv_application_1563894657824_0447_0/lib/python3.6/site-packages/textdistance/algorithms/edit_based.py", line 49, in __call__
result = self.quick_answer(*sequences)
File "/DATOS/var/log/hadoop/yarn/local/usercache/bguser/appcache/application_1563894657824_0447/container_e66_1563894657824_0447_01_000002/virtualenv_application_1563894657824_0447_0/lib/python3.6/site-packages/textdistance/algorithms/base.py", line 91, in quick_answer
if self._ident(*sequences):
File "/DATOS/var/log/hadoop/yarn/local/usercache/bguser/appcache/application_1563894657824_0447/container_e66_1563894657824_0447_01_000002/virtualenv_application_1563894657824_0447_0/lib/python3.6/site-packages/textdistance/algorithms/base.py", line 110, in _ident
if e1 != e2:
File "/DATOS/var/log/hadoop/yarn/local/usercache/bguser/appcache/application_1563894657824_0447/container_e66_1563894657824_0447_01_000002/virtualenv_application_1563894657824_0447_0/lib/python3.6/site-packages/pandas/core/generic.py", line 1556, in __nonzero__
self.__class__.__name__
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
谢谢。解决了这个问题:
algoritmos_comparacion_udf=f.pandas_udf(lambda s: s.apply(algoritmos_comparacion),MapType(StringType(),StringType()))
dataframe.withColumn("algorithms", algoritmos_comparacion_udf(f.col("a"), f.col("b")))
您是否通过col(“num\u serie\u exp”)和col(“num\u serie\u rec”)传递iterable对象可能重复?@mazaneicha不重复,因为我不使用任何布尔运算符。请创建MVCE好吗?在这种情况下,帮助您会更容易。@johnckane是的,在带有lambda的pandas udf中使用函数apply和python函数。func_udf=f.pandas_udf(lambda s:s.apply(func))看起来不再支持
MapType
,我可以使用常规的Spark SQL udf,而不是pandas_udf来计算。当然,您必须对pandas_udf使用其他类型
import textdistance
import pyspark.sql.functions as f
def algoritmos_comparacion(num_serie_rec, num_serie_exp):
data = {}
algoritmos = {
"hamming":textdistance.hamming,
"levenshtein":textdistance.levenshtein,
"damerau_levenshtein":textdistance.damerau_levenshtein,
"jaro":textdistance.jaro,
"mlipns":textdistance.mlipns,
"strcmp95":textdistance.strcmp95,
"needleman_wunsch":textdistance.needleman_wunsch,
"gotoh":textdistance.gotoh,
"smith_waterman":textdistance.smith_waterman
}
for name, alg in algoritmos.items():
try:
data[name] = str(alg(num_serie_rec, num_serie_exp))
except:
data[name] = "ERROR"
return data
algoritmos_comparacion_udf=f.pandas_udf(algoritmos_comparacion,MapType(StringType(),StringType()))
dataframe.withColumn("algorithms", algoritmos_comparacion_udf(f.col("a"), f.col("b")))
algoritmos_comparacion_udf=f.pandas_udf(lambda s: s.apply(algoritmos_comparacion),MapType(StringType(),StringType()))
dataframe.withColumn("algorithms", algoritmos_comparacion_udf(f.col("a"), f.col("b")))