Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/366.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将RDD转换为Spark数据帧时出现Unicode错误_Python_Apache Spark_Pyspark - Fatal编程技术网

Python 将RDD转换为Spark数据帧时出现Unicode错误

Python 将RDD转换为Spark数据帧时出现Unicode错误,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,在数据帧上运行show方法时,出现以下错误 Py4JJavaError: An error occurred while calling o14904.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23450.0 failed 1 times, most recent failure: Lost task 0.0 in stage 23450.0 (

在数据帧上运行show方法时,出现以下错误

Py4JJavaError: An error occurred while calling o14904.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23450.0 failed 1 times, most recent failure: Lost task 0.0 in stage 23450.0 (TID 120652, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/i854319/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/Users/i854319/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/i854319/spark2/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "<ipython-input-8-b76896bc4e43>", line 320, in <lambda>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-5: ordinal not in range(128)

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.next(PythonRDD.scala:156)
但是当我这样做的时候,我得到了错误

jpsa_rf.features_df.show(12)
+------------+--------------------+
|Feature_name|    Importance_value|
+------------+--------------------+
| competitive|0.019380017988201638|
|         new|0.012416277407924172|
|self-reliant|0.009044388916918005|
|     related|0.008968947484358822|
|      retail|0.008729510712416655|
|      sales,|0.007680271475590303|
|        work|0.007548541044789985|
| performance|0.007209008630295571|
|    superior|0.007065626808393139|
|     license|0.006436001036918034|
|    industry|0.006416712169788629|
|      record|0.006227581067732823|
+------------+--------------------+
only showing top 12 rows
我创建了这个数据框,如下所示:它基本上是一个由随机森林模型中的特征及其重要性值组成的数据框

vocab=np.array(self.cvModel.bestModel.stages[3].vocabulary)
        if est_name=="rf":
            feature_importance=self.cvModel.bestModel.stages[5].featureImportances.toArray()
            argsort_feature_indices=feature_importance.argsort()[::-1]
        elif est_name=="blr":
            feature_importance=self.cvModel.bestModel.stages[5].coefficients.toArray()
            argsort_feature_indices=abs(feature_importance).argsort()[::-1]
        # Sort the features importance array in descending order and get the indices

        feature_names=vocab[argsort_feature_indices]

        self.features_df=sc.parallelize(zip(feature_names,feature_importance[argsort_feature_indices])).\
        map(lambda x: (str(x[0]),float(x[1]))).toDF(["Feature_name","Importance_value"])

我假设您使用的是Python 2。手头的问题很可能是
str(x[0])
部分在
df.map
中。似乎
x[0]
引用了一个unicode字符串,
str
应该将其转换为bytestring。然而,它通过隐式地假设ASCII编码来实现,ASCII编码只适用于纯英语文本

事情不应该这样做

简单的回答是:将
str(x[0])
更改为
x[0].encode('utf-8')


长答案可以找到,例如或。

我假设您使用的是Python 2。手头的问题很可能是
str(x[0])
部分在
df.map
中。似乎
x[0]
引用了一个unicode字符串,
str
应该将其转换为bytestring。然而,它通过隐式地假设ASCII编码来实现,ASCII编码只适用于纯英语文本

事情不应该这样做

简单的回答是:将
str(x[0])
更改为
x[0].encode('utf-8')

答案很长,例如:或