Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/296.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 已解决:Jupyter笔记本电脑PySpark OS错误[WinError 123]文件名、目录名或卷标语法不正确:_Python_Pyspark_Anaconda_Rdd_Sklearn Pandas - Fatal编程技术网

Python 已解决:Jupyter笔记本电脑PySpark OS错误[WinError 123]文件名、目录名或卷标语法不正确:

Python 已解决:Jupyter笔记本电脑PySpark OS错误[WinError 123]文件名、目录名或卷标语法不正确:,python,pyspark,anaconda,rdd,sklearn-pandas,Python,Pyspark,Anaconda,Rdd,Sklearn Pandas,系统配置: 操作系统:Windows 10 Python版本:3.7 Spark版本:2.4.4 SPARK\u主页:C:\SPARK\SPARK-2.4.4-bin-hadoop2.7 问题 我使用PySpark对数据帧中一行的所有列进行并行计算。我将Pandas数据帧转换为Spark数据帧。在spark数据帧上,执行映射转换和收集操作。在执行收集操作时,会弹出带有OSError的Py4J错误。错误出现在import sklearn语句和经过训练的分类器(ML模型)中 代码片段 错误 我遇到了

系统配置: 操作系统:Windows 10 Python版本:3.7 Spark版本:2.4.4 SPARK\u主页:C:\SPARK\SPARK-2.4.4-bin-hadoop2.7

问题 我使用PySpark对数据帧中一行的所有列进行并行计算。我将Pandas数据帧转换为Spark数据帧。在spark数据帧上,执行映射转换和收集操作。在执行收集操作时,会弹出带有OSError的Py4J错误。错误出现在import sklearn语句和经过训练的分类器(ML模型)中

代码片段

错误


我遇到了相同的问题,文件路径包含
C:\\C:\\
。我在讨论中发现,这可能是
scikit learn
中使用的
pytest
的问题。该问题在0.21.3版的scikit学习中报告。我将我的
scikit-learn
软件包升级到0.22.1(通过升级到Anaconda 2020.02),错误消失了


我的环境是Windows 10、Spark 2.4.5、Anaconda 2020.02(其中包含scikit learn 0.22.1)。请注意,较旧的Anaconda版本2019.10包含
scikit learn
0.21.3版。

Hi!卷标确实不正确。请注意,它指的是
C:\\C:\\
。其次,如果可以使用Pandas数据帧训练模型,为什么不继续使用Pandas进行映射(使用
pd.Dataframe.apply
)?这样,您就不会尝试对经过训练的模型进行酸洗,这可能不起作用(在这种情况下,使用joblib可能会更好)。是的,卷标似乎不正确,但关键是我在整个代码中没有提到卷标,它会隐式地接受它。我想使用所有可用的内核来进行并行计算,以加快执行速度,因此我使用PySpark,而不是继续使用Pandas pd.DataFrame.apply。从stacktrace中,我得到的印象是您的分类器没有被pickle。不是通过映射函数传递,而是让每个分区加载(从磁盘)经过训练的模型。如果没有帮助,请提供一个.Hey@OliverW。!我尝试了你的建议,但我通过sklearn软件包得到了同样的错误。我正在用mcve更新发布的问题。请看一下,让我知道我的错误。你的MCVE在我的机器上运行(Ubuntu 18.04)。您正在使用哪个版本的Spark和sklearn(在pyspark导入后运行
print(pyspark.\uu version\uuuu)
)?你能重新安装Spark吗?顺便说一句,你的MCVE很受欢迎:我很少看到有人在被要求提供一个这样好的MCVE后给出这样好的MCVE。嘿@Grace,它就像一个符咒。谢谢。在这段时间里,我为PySpark做了一个变通方法,并开始使用python多处理API。
from sklearn.neural_network.multilayer_perceptron import MLPClassifier
classifier=MLPClassifier()
classifier.fit(x_train, y_train)

def func1(rows,trained_model=classifier):
    items = rows.asDict()
    row = pd.Series(items)
    output = func2(row,trained_model) # Consumes pandas series in other file having import sklearn statement
    return output

spdf=spark.createDataFrame(pandasDF)
result=spdf.rdd.map(lambda row:func1(row)).collect()
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-33-0bfb9d088e2d> in <module>
----> 1 result=spdf.rdd.map(lambda row:clusterCreation(row)).collect()
      2 print(type(result))
.
.
.
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 2.0 failed 1 times, most recent failure: Lost task 2.0 in stage 2.0 (TID 5, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 364, in main
  File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 71, in read_command
  File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py", line 580, in loads
    return pickle.loads(obj, encoding=encoding)
.
.
.
 File "C:\Users\rkagr\Anaconda3\lib\site-packages\sklearn\ensemble\__init__.py", line 7, in <module>
    from .forest import RandomForestClassifier
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py", line 53, in <module>
    from ..metrics import r2_score
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\sklearn\metrics\__init__.py", line 7, in <module>
    from .ranking import auc
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\sklearn\metrics\ranking.py", line 35, in <module>
    from ..preprocessing import label_binarize
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\sklearn\preprocessing\__init__.py", line 6, in <module>
    from ._function_transformer import FunctionTransformer
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\sklearn\preprocessing\_function_transformer.py", line 5, in <module>
    from ..utils.testing import assert_allclose_dense_sparse
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\sklearn\utils\testing.py", line 718, in <module>
    import pytest
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\pytest.py", line 6, in <module>
    from _pytest.assertion import register_assert_rewrite
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\_pytest\assertion\__init__.py", line 6, in <module>
    from _pytest.assertion import rewrite
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\_pytest\assertion\rewrite.py", line 20, in <module>
    from _pytest.assertion import util
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\_pytest\assertion\util.py", line 5, in <module>
    import _pytest._code
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\_pytest\_code\__init__.py", line 2, in <module>
    from .code import Code  # noqa
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\_pytest\_code\code.py", line 11, in <module>
    import pluggy
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\pluggy\__init__.py", line 16, in <module>
    from .manager import PluginManager, PluginValidationError
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\pluggy\manager.py", line 6, in <module>
    import importlib_metadata
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 466, in <module>
    __version__ = version(__name__)
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 433, in version
    return distribution(package).version
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 406, in distribution
    return Distribution.from_name(package)
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 176, in from_name
    dist = next(dists, None)
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 362, in <genexpr>
    for path in map(cls._switch_path, paths)
  File "C:\Users\rkagr\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 377, in _search_path
    if not root.is_dir():
  File "C:\Users\rkagr\Anaconda3\lib\pathlib.py", line 1351, in is_dir
    return S_ISDIR(self.stat().st_mode)
  File "C:\Users\rkagr\Anaconda3\lib\pathlib.py", line 1161, in stat
    return self._accessor.stat(self)
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'C:\\C:\\spark\\spark-2.4.4-bin-hadoop2.7\\jars\\spark-core_2.11-2.4.4.jar'
import findspark

findspark.init()
findspark.find()

import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName('MRC').setMaster('local[2]')
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession.builder.getOrCreate()

import sklearn
import sklearn.datasets
import sklearn.model_selection
import sklearn.ensemble

iris = sklearn.datasets.load_iris()
train, test, labels_train, labels_test = sklearn.model_selection.train_test_split(iris.data, iris.target, train_size=0.80)

classifier = sklearn.ensemble.RandomForestClassifier()
classifier.fit(train, labels_train)

import pickle
path = './random_classifier.mdl'
pickle.dump(classifier, open(path,'wb'))

import pandas as pd
pddf=pd.DataFrame(test)
spdf=spark.createDataFrame(pddf)

def clusterCreation(rows,classifier_path):
    items = rows.asDict()
    row = pd.Series(items)
    with open(classifier_path,'rb') as fp:
        classifier = pickle.load(fp)
        print(classifier)
    return items

result=spdf.rdd.map(lambda row:clusterCreation(row,classifier_path=path)).collect()
print(result)