Python 使用pyspark udf时发生导入错误
我正在尝试使用spark submit运行spark应用程序。 我创建了以下udf:Python 使用pyspark udf时发生导入错误,python,apache-spark,pyspark,python-import,spark-submit,Python,Apache Spark,Pyspark,Python Import,Spark Submit,我正在尝试使用spark submit运行spark应用程序。 我创建了以下udf: from pyspark.sql.functions import udf from pyspark.sql.types import StringType from tldextract import tldextract @udf(StringType()) def get_domain(url): ext = tldextract.extract(url) return ext.doma
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from tldextract import tldextract
@udf(StringType())
def get_domain(url):
ext = tldextract.extract(url)
return ext.domain
然后我这样用它:
df = df.withColumn('domain', col=get_domain(df['url']))
并得到以下错误:
Driver stacktrace:
21/01/03 16:53:41 INFO DAGScheduler: Job 1 failed: showString at NativeMethodAccessorImpl.java:0, took 2.842401 s
Traceback (most recent call last):
File "/home/michal/dv-etl/main.py", line 54, in <module>
main()
File "/home/michal/dv-etl/main.py", line 48, in main
df.show(truncate=False)
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 442, in show
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in deco
File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 589, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 254, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 74, in read_command
command = serializer._read_with_length(file)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_length
return self.loads(obj)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 458, in loads
return pickle.loads(obj, encoding=encoding)
File "/opt/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 1110, in subimport
__import__(name)
ModuleNotFoundError: No module named 'tldextract'
谢谢大家! 这回答了你的问题吗?您是否设置了环境变量PYSPARK_PYTHON和PYSPARK_DRIVER_PYTHON?能否尝试将
从tldextract导入tldextract
放在get_域(URL)下?这是否回答了您的问题?是否设置了环境变量PYSPARK_PYTHON和PYSPARK_DRIVER_PYTHON?是否可以尝试将从tldextract导入tldextract
放在get_域(URL)下?
spark-submit --master spark://spark-server:7077 main.py --py-files dist/app-0.0.1-py3.8.egg requirements.zip