Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用pyspark udf时发生导入错误_Python_Apache Spark_Pyspark_Python Import_Spark Submit - Fatal编程技术网

Python 使用pyspark udf时发生导入错误

Python 使用pyspark udf时发生导入错误,python,apache-spark,pyspark,python-import,spark-submit,Python,Apache Spark,Pyspark,Python Import,Spark Submit,我正在尝试使用spark submit运行spark应用程序。 我创建了以下udf: from pyspark.sql.functions import udf from pyspark.sql.types import StringType from tldextract import tldextract @udf(StringType()) def get_domain(url): ext = tldextract.extract(url) return ext.doma

我正在尝试使用spark submit运行spark应用程序。 我创建了以下udf:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from tldextract import tldextract

@udf(StringType())
def get_domain(url):
    ext = tldextract.extract(url)
    return ext.domain
然后我这样用它:

df = df.withColumn('domain', col=get_domain(df['url']))
并得到以下错误:

Driver stacktrace:
21/01/03 16:53:41 INFO DAGScheduler: Job 1 failed: showString at NativeMethodAccessorImpl.java:0, took 2.842401 s
Traceback (most recent call last):
  File "/home/michal/dv-etl/main.py", line 54, in <module>
    main()
  File "/home/michal/dv-etl/main.py", line 48, in main
    df.show(truncate=False)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 442, in show
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in deco
  File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 589, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 254, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 74, in read_command
    command = serializer._read_with_length(file)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_length
    return self.loads(obj)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 458, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 1110, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'tldextract'

谢谢大家!

这回答了你的问题吗?您是否设置了环境变量PYSPARK_PYTHON和PYSPARK_DRIVER_PYTHON?能否尝试将
从tldextract导入tldextract
放在get_域(URL)下?这是否回答了您的问题?是否设置了环境变量PYSPARK_PYTHON和PYSPARK_DRIVER_PYTHON?是否可以尝试将
从tldextract导入tldextract
放在get_域(URL)下?
spark-submit --master spark://spark-server:7077 main.py --py-files dist/app-0.0.1-py3.8.egg requirements.zip