Apache spark Spark 2.1 PySpark Bug:sc.textFile(";test.txt";).repartition(2.collect()
使用普通文本文件:Apache spark Spark 2.1 PySpark Bug:sc.textFile(";test.txt";).repartition(2.collect(),apache-spark,pyspark,Apache Spark,Pyspark,使用普通文本文件: echo "a\nb\nc\nd" >> test.txt 使用vanilla spark-2.1.0-bin-hadoop2.7.tgz,以下操作失败。同样的测试适用于旧版本的Spark: $ bin/pyspark Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_
echo "a\nb\nc\nd" >> test.txt
使用vanilla spark-2.1.0-bin-hadoop2.7.tgz,以下操作失败。同样的测试适用于旧版本的Spark:
$ bin/pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
SparkSession available as 'spark'.
>>> sc.textFile("test.txt").collect()
[u'a', u'b', u'c', u'd']
>>> sc.textFile("test.txt").repartition(2).collect()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/admin/opt/spark/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py", line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
File "/Users/admin/opt/spark/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py", line 140, in _load_from_socket
for item in serializer.load_stream(rf):
File "/Users/admin/opt/spark/spark-2.1.0-bin-hadoop2.7/python/pyspark/serializers.py", line 529, in load_stream
yield self.loads(stream)
File "/Users/admin/opt/spark/spark-2.1.0-bin-hadoop2.7/python/pyspark/serializers.py", line 524, in loads
return s.decode("utf-8") if self.use_unicode else s
File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
在本地安装中,当您在第一次测试中使用python时,您正在使用Scala
对于python,请参见@mariusz的答案,这是一个已知的bug。它将在Spark 2.1之后固定
更新:确认在Spark 2.1.1中修复了Python中的测试错误,并在Scala中工作。我知道。我指出了这一点。我需要帮助Python版本正常工作。我认为链接帖子与我的问题无关。pyspark示例在Spark 2.0.2中运行良好
$ bin/spark-shell
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sc.textFile("test.txt").collect()
res0: Array[String] = Array(a, b, c, d)
scala> sc.textFile("test.txt").repartition(2).collect()
res1: Array[String] = Array(a, c, d, b)