Python Pyspark模块与纱线的配合使用

Python Pyspark模块与纱线的配合使用,python,pyspark,deployment,module,dependencies,Python,Pyspark,Deployment,Module,Dependencies,我在客户机模式下使用pyspark shell。为了共享一些python模块,我使用conda pack创建了一个归档文件,但我遇到了一些问题 我宣布: pyspark --packages org.apache.spark:spark-avro_2.12:3.0.1 --archives /tmp/testST.tar.gz 其中/tmp/testST.tar.gz是一个conda环境,创建为: conda pack -n testSATE -o /tmp/testST.tar.gz 当我

我在客户机模式下使用pyspark shell。为了共享一些python模块,我使用conda pack创建了一个归档文件,但我遇到了一些问题

我宣布:

pyspark --packages org.apache.spark:spark-avro_2.12:3.0.1 --archives /tmp/testST.tar.gz
其中/tmp/testST.tar.gz是一个conda环境,创建为:

conda pack -n testSATE -o /tmp/testST.tar.gz
当我导入pyarrow模块(在归档文件中)时,我获得:

>>> import pyarrow
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pyarrow'
之后,我决定在我与conda合作的机器上部署该模块,并将pyspark作为:

pyspark --packages org.apache.spark:spark-avro_2.12:3.0.1 --conf spark.yarn.dist.archives=/tmp/testST.tar.gz
pyspark --packages org.apache.spark:spark-avro_2.12:3.0.1
pyarrow模块这次被正确找到,但当我执行时:

>>> outputDF = sateFloatDF.withColumn("prediction", loaded_model(sateFloatDF.select("_c0","_c1","_c2")))
我收到:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/spark3/python/pyspark/sql/udf.py", line 197, in wrapper
    return self(*args)
  File "/opt/spark3/python/pyspark/sql/udf.py", line 177, in __call__
    return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
  File "/opt/spark3/python/pyspark/sql/column.py", line 68, in _to_seq
    cols = [converter(c) for c in cols]
  File "/opt/spark3/python/pyspark/sql/column.py", line 68, in <listcomp>
    cols = [converter(c) for c in cols]
  File "/opt/spark3/python/pyspark/sql/column.py", line 56, in _to_java_column
    "function.".format(col, type(col)))
TypeError: Invalid argument, not a string or column: DataFrame[_c0: float, _c1: float, _c2: float] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
>>> outputDF = sateFloatDF.withColumn("prediction", loaded_model(sateFloatDF._c0,sateFloatDF._c1,sateFloatDF._c2))
>>> outputDF.show()
21/04/19 10:39:07 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, gstp-slave-60-01.altecspace.it, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark3/python/pyspark/worker.py", line 589, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/opt/spark3/python/pyspark/worker.py", line 447, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File "/opt/spark3/python/pyspark/worker.py", line 254, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/opt/spark3/python/pyspark/worker.py", line 74, in read_command
    command = serializer._read_with_length(file)
  File "/opt/spark3/python/pyspark/serializers.py", line 172, in _read_with_length
    return self.loads(obj)
  File "/opt/spark3/python/pyspark/serializers.py", line 458, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/opt/spark3/python/pyspark/cloudpickle.py", line 1110, in subimport
    __import__(name)
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
包装器中的文件“/opt/spark3/python/pyspark/sql/udf.py”,第197行
返回自我(*args)
文件“/opt/spark3/python/pyspark/sql/udf.py”,第177行,在调用中__
返回列(judf.apply(_to_seq(sc,cols,_to_java_列)))
文件“/opt/spark3/python/pyspark/sql/column.py”,第68行,在
cols=[cols中c的转换器(c)]
文件“/opt/spark3/python/pyspark/sql/column.py”,第68行,在
cols=[cols中c的转换器(c)]
文件“/opt/spark3/python/pyspark/sql/column.py”,第56行,在java列中
“函数。”。格式(列,类型(列)))
TypeError:参数无效,不是字符串或列:类型为的DataFrame[\u c0:float,\u c1:float,\u c2:float]。对于列文字,请使用“lit”、“array”、“struct”或“create_map”函数。
>>>outputDF=带列的sateFloatDF.(“预测”,加载的模型(sateFloatDF.\U c0,sateFloatDF.\U c1,sateFloatDF.\U c2))
>>>outputDF.show()
19年4月21日10:39:07 WARN scheduler.TaskSetManager:stage 2.0中丢失的任务0.0(TID 2,gstp-slave-60-01.altecspace.it,executor 2):org.apache.spark.api.python.python异常:回溯(最近一次调用):
文件“/opt/spark3/python/pyspark/worker.py”,第589行,在main中
func、探查器、反序列化器、序列化器=读取自定义项(pickleSer、infle、eval类型)
read_udfs中的文件“/opt/spark3/python/pyspark/worker.py”,第447行
append(read_single_udf(pickleSer、infle、eval_类型、runner_conf、udf_index=i))
文件“/opt/spark3/python/pyspark/worker.py”,第254行,在read\u single\u udf中
f、 return\u type=read\u命令(pickleSer,infle)
read_命令第74行的文件“/opt/spark3/python/pyspark/worker.py”
命令=序列化程序。\读取长度为的\u(文件)
文件“/opt/spark3/python/pyspark/serializers.py”,第172行,长度为
返回自加载(obj)
加载文件“/opt/spark3/python/pyspark/serializers.py”,第458行
返回pickle.load(对象,编码=编码)
文件“/opt/spark3/python/pyspark/cloudpickle.py”,第1110行,在子导入中
__导入(名称)
我还没有清除Pypark执行流程

你能帮我吗

谢谢