Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/http/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 使用Spark_sklearn GridSearchCV进行嵌套交叉验证会导致Spark-5063错误_Apache Spark_Scikit Learn_Pyspark - Fatal编程技术网

Apache spark 使用Spark_sklearn GridSearchCV进行嵌套交叉验证会导致Spark-5063错误

Apache spark 使用Spark_sklearn GridSearchCV进行嵌套交叉验证会导致Spark-5063错误,apache-spark,scikit-learn,pyspark,Apache Spark,Scikit Learn,Pyspark,使用Spark_sklearn GridSearchCV作为内部cv和sklearn cross_validate/cross_val_score作为外部cv执行嵌套交叉验证会导致“您似乎试图从广播变量、操作或转换引用SparkContext”错误 我已尝试将n_jobs=-1设置为n_jobs=1以删除基于joblib的并行性并重试,但仍然会产生相同的异常 异常:您似乎试图从广播变量、操作或转换引用SparkContext。SparkContext只能在驱动程序上使用,不能在工作程序上运行的代

使用Spark_sklearn GridSearchCV作为内部cv和sklearn cross_validate/cross_val_score作为外部cv执行嵌套交叉验证会导致“您似乎试图从广播变量、操作或转换引用SparkContext”错误

我已尝试将
n_jobs=-1
设置为
n_jobs=1
以删除基于joblib的并行性并重试,但仍然会产生相同的异常

异常:您似乎试图从广播变量、操作或转换引用SparkContext。SparkContext只能在驱动程序上使用,不能在工作程序上运行的代码中使用。有关更多信息,请参阅SPARK-5063

完成回溯(最近一次呼叫最后一次):
文件“model_evaluation.py”,第350行,在
main()
文件“model_evaluation.py”,第269行,主目录
分数=交叉验证(gs、X、y、cv=外部cv,分数=分数度量,n\u作业=-1,返回\u训练\u分数=假)
文件“./python27/lib/python2.7/site packages/sklearn/model\u selection/\u validation.py”,第195行,交叉验证
对于列车,在等速分段(X、y、组)中进行试验
文件“./python27/lib/python2.7/site packages/sklearn/externals/joblib/parallel.py”,第779行,在调用中__
而self.dispatch\u一批(迭代器):
文件“./python27/lib/python2.7/site packages/sklearn/externals/joblib/parallel.py”,第620行,一批发送
tasks=BatchedCalls(itertools.islice(迭代器,批大小))
文件“./python27/lib/python2.7/site packages/sklearn/externals/joblib/parallel.py”,第127行,在__
self.items=列表(迭代器\u切片)
文件“./python27/lib/python2.7/site packages/sklearn/model_selection/_validation.py”,第195行,在
对于列车,在等速分段(X、y、组)中进行试验
文件“./python27/lib/python2.7/site packages/sklearn/base.py”,第61行,克隆
新建对象参数[名称]=克隆(参数,安全=假)
文件“./python27/lib/python2.7/site packages/sklearn/base.py”,第52行,克隆
返回副本。深度副本(估计员)
File“/usr/local/lib/python2.7/copy.py”,第182行,deepcopy
rv=减速器(2)
文件“/usr/local/lib/spark/python/pyspark/context.py”,第279行,在__
“您似乎正试图从广播中引用SparkContext”
异常:您似乎正试图从广播中引用SparkContext
变量、动作或变换。SparkContext只能用于驱动程序,不能用于
在它在工作程序上运行的代码中。有关更多信息,请参阅SPARK-5063。

编辑:
问题似乎是sklearn cross_validate()以类似于pickle estimator对象的方式克隆每个拟合的估计器,PySpark GridsearchCV估计器不允许这样做,因为SparkContext()对象不能/不应该被pickle。那么我们如何正确克隆估计器呢?

我终于想出了一个解决方案。当scikit learn clone()函数尝试深度复制SparkContext对象时,就会出现问题。我使用的解决方案有点老套,如果有更好的解决方案,我肯定会改弦易辙,但它确实有效。导入copy类并重写deepcopy()函数,以便在看到SparkContext对象时忽略它

# Mock the deep-copy function to ignore copying sparkcontext objects
# Helps avoid pickling error or broadcast variable errors
import copy
_deepcopy = copy.deepcopy

def mock_deepcopy(*args, **kwargs):
    if isinstance(args[0], SparkContext):
        return args[0]
    return _deepcopy(*args, **kwargs)

copy.deepcopy = mock_deepcopy

因此,现在它不会尝试复制SparkContext对象,而且似乎所有操作都正常。

这很酷,但我不敢相信我找不到更简单的答案。现在一定有办法用Spark制作嵌套CV了吧?
Complete Traceback (most recent call last):
  File "model_evaluation.py", line 350, in <module>
    main()
  File "model_evaluation.py", line 269, in main
    scores = cross_validate(gs, X, y, cv=outer_cv, scoring=scoring_metric, n_jobs=-1, return_train_score=False)
  File "../python27/lib/python2.7/site-packages/sklearn/model_selection/_validation.py", line 195, in cross_validate
    for train, test in cv.split(X, y, groups))
  File "../python27/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 779, in __call__
    while self.dispatch_one_batch(iterator):
  File "../python27/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 620, in dispatch_one_batch
    tasks = BatchedCalls(itertools.islice(iterator, batch_size))
  File "../python27/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 127, in __init__
    self.items = list(iterator_slice)
  File "../python27/lib/python2.7/site-packages/sklearn/model_selection/_validation.py", line 195, in <genexpr>
    for train, test in cv.split(X, y, groups))
  File "../python27/lib/python2.7/site-packages/sklearn/base.py", line 61, in clone
    new_object_params[name] = clone(param, safe=False)
  File "../python27/lib/python2.7/site-packages/sklearn/base.py", line 52, in clone
    return copy.deepcopy(estimator)
  File "/usr/local/lib/python2.7/copy.py", line 182, in deepcopy
    rv = reductor(2)
  File "/usr/local/lib/spark/python/pyspark/context.py", line 279, in __getnewargs__
    "It appears that you are attempting to reference SparkContext from a broadcast "
Exception: It appears that you are attempting to reference SparkContext from a broadcast 
variable, action, or transformation. SparkContext can only be used on the driver, not 
in code that it run on workers. For more information, see SPARK-5063.
# Mock the deep-copy function to ignore copying sparkcontext objects
# Helps avoid pickling error or broadcast variable errors
import copy
_deepcopy = copy.deepcopy

def mock_deepcopy(*args, **kwargs):
    if isinstance(args[0], SparkContext):
        return args[0]
    return _deepcopy(*args, **kwargs)

copy.deepcopy = mock_deepcopy