Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Spark cluster上的GridSearchCV-导入错误:未命名模块_Python_Apache Spark_Machine Learning - Fatal编程技术网

Python Spark cluster上的GridSearchCV-导入错误:未命名模块

Python Spark cluster上的GridSearchCV-导入错误:未命名模块,python,apache-spark,machine-learning,Python,Apache Spark,Machine Learning,我正在尝试使用Spark sklearn库对Spark群集执行网格搜索。因此,我在我的bashshell上运行nohup./spark\u python\u shell.sh>output.log&来点燃spark集群,我还运行了我的python脚本(请参见下面的spark submit\--master warn'rforest\u grid\u search.py'): 在这个rforest\u grid\u search.pypython脚本中,有以下源代码试图将网格搜索与Spark群集连

我正在尝试使用Spark sklearn库对Spark群集执行网格搜索。因此,我在我的
bash
shell上运行
nohup./spark\u python\u shell.sh>output.log&
来点燃spark集群,我还运行了我的python脚本(请参见下面的
spark submit\--master warn'rforest\u grid\u search.py'
):

在这个
rforest\u grid\u search.py
python脚本中,有以下源代码试图将网格搜索与Spark群集连接起来:

# Spark configuration
from pyspark import SparkContext, SparkConf
conf = SparkConf()
sc = SparkContext(conf=conf)
print('Spark Context:', sc)

# Hyperparameters' grid
parameters = {'n_estimators': list(range(150, 200, 25)), 'criterion': ['gini', 'entropy'], 'max_depth': list(range(2, 11, 2)), 'max_features': [i/10. for i in range(10, 16)], 'class_weight': [{0: 1, 1: i/10.} for i in range(10, 17)], 'min_samples_split': list(range(2, 7))}

# Execute grid search - using spark_sklearn library
from spark_sklearn import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
classifiers_grid = GridSearchCV(sc, estimator=RandomForestClassifier(), param_grid=parameters, scoring='precision', cv=5, n_jobs=-1)
classifiers_grid.fit(X, y)
当我运行python脚本时,我在
classifiers\u grid.fit(X,y)
行得到一个错误,如下所示:

ImportError: No module named model_selection._validation
...
    ('Spark Context:', <SparkContext master=yarn appName=rforest_grid_search.py>)
...
    18/10/24 12:43:50 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, oser404637.*****.com, executor 2, partition 2, PROCESS_LOCAL, 42500 bytes)
    18/10/24 12:43:50 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, oser404637.*****.com, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
      File "/u/applic/data/hdfs2/hadoop/yarn/local/usercache/*****/appcache/application_1539785180345_36939/container_e126_1539785180345_36939_01_000003/pyspark.zip/pyspark/worker.py", line 216, in main
        func, profiler, deserializer, serializer = read_command(pickleSer, infile)
      File "/u/applic/data/hdfs2/hadoop/yarn/local/usercache/*****/appcache/application_1539785180345_36939/container_e126_1539785180345_36939_01_000003/pyspark.zip/pyspark/worker.py", line 58, in read_command
        command = serializer._read_with_length(file)
      File "/u/applic/data/hdfs2/hadoop/yarn/local/usercache/*****/appcache/application_1539785180345_36939/container_e126_1539785180345_36939_01_000003/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length
        return self.loads(obj)
      File "/u/applic/data/hdfs2/hadoop/yarn/local/usercache/*****/appcache/application_1539785180345_36939/container_e126_1539785180345_36939_01_000003/pyspark.zip/pyspark/serializers.py", line 562, in loads
        return pickle.loads(obj)
    ImportError: No module named model_selection._validation

            at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
            at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
            at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
...
或者更详细地说(但由于太长,不包括所有内容)如下:

ImportError: No module named model_selection._validation
...
    ('Spark Context:', <SparkContext master=yarn appName=rforest_grid_search.py>)
...
    18/10/24 12:43:50 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, oser404637.*****.com, executor 2, partition 2, PROCESS_LOCAL, 42500 bytes)
    18/10/24 12:43:50 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, oser404637.*****.com, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
      File "/u/applic/data/hdfs2/hadoop/yarn/local/usercache/*****/appcache/application_1539785180345_36939/container_e126_1539785180345_36939_01_000003/pyspark.zip/pyspark/worker.py", line 216, in main
        func, profiler, deserializer, serializer = read_command(pickleSer, infile)
      File "/u/applic/data/hdfs2/hadoop/yarn/local/usercache/*****/appcache/application_1539785180345_36939/container_e126_1539785180345_36939_01_000003/pyspark.zip/pyspark/worker.py", line 58, in read_command
        command = serializer._read_with_length(file)
      File "/u/applic/data/hdfs2/hadoop/yarn/local/usercache/*****/appcache/application_1539785180345_36939/container_e126_1539785180345_36939_01_000003/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length
        return self.loads(obj)
      File "/u/applic/data/hdfs2/hadoop/yarn/local/usercache/*****/appcache/application_1539785180345_36939/container_e126_1539785180345_36939_01_000003/pyspark.zip/pyspark/serializers.py", line 562, in loads
        return pickle.loads(obj)
    ImportError: No module named model_selection._validation

            at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
            at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
            at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
...
。。。
('Spark Context:',)
...
18/10/24 12:43:50 INFO scheduler.TaskSetManager:在阶段0.0中启动任务2.0(TID 2,oser404637.*****.com,executor 2,分区2,进程_LOCAL,42500字节)
18/10/24 12:43:50 WARN scheduler.TaskSetManager:在0.0阶段丢失任务0.0(TID 0,oser404637.*****.com,executor 2):org.apache.spark.api.python.python异常:回溯(最近一次调用):
文件“/u/applic/data/hdfs2/hadoop/thread/local/usercache/******/appcache/application_1539785180345_36939/container_e126_1539785180345_36939_01_000003/pyspark.zip/pyspark/worker.py”,主行216
func、探查器、反序列化器、序列化器=读取命令(pickleSer、infle)
文件“/u/applic/data/hdfs2/hadoop/thread/local/usercache/******/appcache/application\u 1539785180345\u 36939/container\u e126\u 1539785180345\u 36939\u 01_000003/pyspark.zip/pyspark/worker.py”,第58行,在read_命令中
命令=序列化程序。\读取长度为的\u(文件)
文件“/u/applic/data/hdfs2/hadoop/thread/local/usercache/******/appcache/application\u 1539785180345\u 36939/container\u e126\u 1539785180345\u 36939\u 01\u000003/pyspark.zip/pyspark/serializers.py”,第170行,以带长度的“读取”
返回自加载(obj)
文件“/u/applic/data/hdfs2/hadoop/thread/local/usercache/******/appcache/application\u 1539785180345\u 36939/container\u e126\u 1539785180345\u 36939\u 01_000003/pyspark.zip/pyspark/serializers.py”,第562行,在加载中
返回酸洗负荷(obj)
导入错误:没有名为模型的模块。\u选择。\u验证
位于org.apache.spark.api.python.BasePythonRunner$readeriator.handlePythonException(PythonRunner.scala:298)
位于org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
位于org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
...
当我运行相同的python脚本,但有一点修改(在交叉验证方面)时,我得到了以下错误:

Traceback (most recent call last):
  File "/data/users/******/rforest_grid_search.py", line 126, in <module>
    classifiers_grid.fit(X, y)
  File "/usr/lib/python2.7/site-packages/spark_sklearn/grid_search.py", line 274, in fit
    return self._fit(X, y, groups, ParameterGrid(self.param_grid))
  File "/usr/lib/python2.7/site-packages/spark_sklearn/grid_search.py", line 321, in _fit
    indexed_out0 = dict(par_param_grid.map(fun).collect())
  File "/u/users/******/spark-2.3.0/python/lib/pyspark.zip/pyspark/rdd.py", line 824, in collect
  File "/u/users/******/spark-2.3.0/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
  File "/u/users/******/spark-2.3.0/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7, oser402389.wal-mart.com, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/u/applic/data/hdfs1/hadoop/yarn/local/usercache/******/appcache/application_1539785180345_42235/container_e126_1539785180345_42235_01_000002/pyspark.zip/pyspark/worker.py", line 216, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/u/applic/data/hdfs1/hadoop/yarn/local/usercache/******/appcache/application_1539785180345_42235/container_e126_1539785180345_42235_01_000002/pyspark.zip/pyspark/worker.py", line 58, in read_command
    command = serializer._read_with_length(file)
  File "/u/applic/data/hdfs1/hadoop/yarn/local/usercache/******/appcache/application_1539785180345_42235/container_e126_1539785180345_42235_01_000002/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length
    return self.loads(obj)
  File "/u/applic/data/hdfs1/hadoop/yarn/local/usercache/******/appcache/application_1539785180345_42235/container_e126_1539785180345_42235_01_000002/pyspark.zip/pyspark/serializers.py", line 562, in loads
    return pickle.loads(obj)
ImportError: No module named sklearn.base
回溯(最近一次呼叫最后一次):
文件“/data/users/*****/rforest\u grid\u search.py”,第126行,在
分类器网格拟合(X,y)
文件“/usr/lib/python2.7/site packages/spark_sklearn/grid_search.py”,第274行,以适合的形式
返回自拟合(X、y、组、参数网格(自参数网格))
文件“/usr/lib/python2.7/site packages/spark\u sklearn/grid\u search.py”,第321行,in\u fit
index_out0=dict(par_param_grid.map(fun.collect())
collect中的文件“/u/users/********/spark-2.3.0/python/lib/pyspark.zip/pyspark/rdd.py”,第824行
文件“/u/users/********/spark-2.3.0/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py”,第1160行,在调用中__
文件“/u/users/*****/spark-2.3.0/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py”,第320行,在get_返回值中
py4j.protocol.Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.collectAndServe时出错。
:org.apache.SparkException:作业因阶段失败而中止:阶段0.0中的任务0失败4次,最近的失败:阶段0.0中的任务0.3丢失(TID 7,oser402389.wal-mart.com,executor 1):org.apache.spark.api.python.python异常:回溯(最近一次调用):
文件“/u/applic/data/hdfs1/hadoop/thread/local/usercache/*********/appcache/application\u 1539785180345\u 42235/container\u e126\u 1539785180345\u 42235\u 01_000002/pyspark.zip/pyspark/worker.py”,主行216
func、探查器、反序列化器、序列化器=读取命令(pickleSer、infle)
文件“/u/applic/data/hdfs1/hadoop/thread/local/usercache/******/appcache/application\u 1539785180345\u 42235/container\u e126\u 1539785180345\u 42235\u 01_000002/pyspark.zip/pyspark/worker.py”,第58行,在read_命令中
命令=序列化程序。\读取长度为的\u(文件)
文件“/u/applic/data/hdfs1/hadoop/thread/local/usercache/*********/appcache/application153978518034542235/container_e1261539785180345_42235_01_000002/pyspark.zip/pyspark/serializers.py”,第170行,以带长度的“读”
返回自加载(obj)
文件“/u/applic/data/hdfs1/hadoop/thread/local/usercache/*********/appcache/application\u 1539785180345\u 42235/container\u e126\u 1539785180345\u 42235\u 01_000002/pyspark.zip/pyspark/serializers.py”,第562行
返回酸洗负荷(obj)
ImportError:没有名为sklearn.base的模块
如何解决此问题并在Spark群集上执行GridSearchCV?

此错误是否仅仅意味着
scikit learn
和/或
spark sklearn
未安装在spark worker节点上
(即使它们显然安装在我用来连接spark群集的spark edge/driver节点上)

此错误是否仅仅意味着scikit learn和/或spark sklearn未安装在spark worker节点上

是的,它确切地表明,或者更准确地说,模块不在Spark workers使用的Python解释器的路径上

一般情况下,工作端代码使用的所有模块必须在每个节点上都可以访问。根据依赖关系的复杂性,有不同的选项

  • 在cont中的每个或上安装所有依赖项