Python 3.x pyspark:按对象属性对RDD排序

Python 3.x pyspark:按对象属性对RDD排序,python-3.x,sorting,pyspark,rdd,Python 3.x,Sorting,Pyspark,Rdd,我有一个名为my_rdd的rdd,看起来像: [FreqSequence(sequence=[['John']], freq=18980), FreqSequence(sequence=[['Mary']], freq=106), FreqSequence(sequence=[['John-Mary']], freq=381), FreqSequence(sequence=[['John-Ann']], freq=158), FreqSequence(sequence=[['An

我有一个名为
my_rdd
的rdd,看起来像:

[FreqSequence(sequence=[['John']], freq=18980), 
 FreqSequence(sequence=[['Mary']], freq=106), 
 FreqSequence(sequence=[['John-Mary']], freq=381), 
 FreqSequence(sequence=[['John-Ann']], freq=158), 
 FreqSequence(sequence=[['Ann']], freq=433)]

然后,我尝试将其分类如下:

new_rdd = my_rdd.sortBy(lambda x: x.freq)
new_rdd.take(5)
但出现以下错误:

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-15-94c1babd943f> in <module>()
      1 print(my_rdd.take(5))
      2 new_rdd = my_rdd.sortBy(lambda x: x.freq)
----> 3 new_rdd.take(5)

/usr/local/spark-latest/python/pyspark/rdd.py in take(self, num)
   1341 
   1342             p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1343             res = self.context.runJob(self, takeUpToNumLeft, p)
   1344 
   1345             items += res

/usr/local/spark-latest/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
    963         # SparkContext#runJob.
    964         mappedRDD = rdd.mapPartitions(partitionFunc)
--> 965         port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
    966         return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
    967 

/usr/local/spark-latest/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/local/spark-latest/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/usr/local/spark-latest/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 65.0 failed 4 times, most recent failure: Lost task 0.3 in stage 65.0 (TID 115, ph-hdp-inv-dn01, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/data/0/yarn/nm/usercache/phanalytics-test/appcache/application_1489740042194_0048/container_e20_1489740042194_0048_01_000002/pyspark.zip/pyspark/worker.py", line 163, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/data/0/yarn/nm/usercache/phanalytics-test/appcache/application_1489740042194_0048/container_e20_1489740042194_0048_01_000002/pyspark.zip/pyspark/worker.py", line 54, in read_command
    command = serializer._read_with_length(file)
  File "/data/0/yarn/nm/usercache/phanalytics-test/appcache/application_1489740042194_0048/container_e20_1489740042194_0048_01_000002/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length
    return self.loads(obj)
  File "/data/0/yarn/nm/usercache/phanalytics-test/appcache/application_1489740042194_0048/container_e20_1489740042194_0048_01_000002/pyspark.zip/pyspark/serializers.py", line 431, in loads
    return pickle.loads(obj, encoding=encoding)
ImportError: No module named 'UserString'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Py4JJavaError回溯(最近一次调用)
在()
1份打印件(我的照片(5))
2 new_rdd=my_rdd.sortBy(lambda x:x.freq)
---->3张新照片(5张)
/usr/local/spark-latest/python/pyspark/rdd.py-in-take(self,num)
1341
1342 p=范围(零件扫描,最小值(零件扫描+数值扫描,总零件))
->1343 res=self.context.runJob(self,takeUpToNumLeft,p)
1344
1345项+=res
/runJob中的usr/local/spark-latest/python/pyspark/context.py(self、rdd、partitionFunc、partitions、allowLocal)
963#SparkContext#runJob。
964 mappedRDD=rdd.mapPartitions(partitionFunc)
-->965 port=self.\u jvm.PythonRDD.runJob(self.\u jsc.sc(),mappedRDD.\u jrdd,分区)
966返回列表(_从_套接字加载(端口,mapperdd._jrdd_反序列化器))
967
/usr/local/spark-latest/python/lib/py4j-0.10.4-src.zip/py4j/java_-gateway.py in___调用(self,*args)
1131 answer=self.gateway\u client.send\u命令(command)
1132返回值=获取返回值(
->1133应答,self.gateway\u客户端,self.target\u id,self.name)
1134
1135对于临时参数中的临时参数:
/装饰中的usr/local/spark-latest/python/pyspark/sql/utils.py(*a,**kw)
61 def装饰(*a,**千瓦):
62尝试:
--->63返回f(*a,**kw)
64除py4j.protocol.Py4JJavaError外的其他错误为e:
65 s=e.java_exception.toString()
/获取返回值中的usr/local/spark-latest/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py(答案、网关客户端、目标id、名称)
317 raise Py4JJavaError(
318“调用{0}{1}{2}时出错。\n”。
-->319格式(目标id,“.”,名称),值)
320其他:
321升起Py4JError(
Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.runJob时出错。
:org.apache.spark.SparkException:作业因阶段失败而中止:阶段65.0中的任务0失败4次,最近的失败:阶段65.0中的任务0.3丢失(TID 115,ph-hdp-inv-dn01,执行器1):org.apache.spark.api.python.python异常:回溯(最近一次调用):
文件“/data/0/warn/nm/usercache/phanalytics test/appcache/application_1489740042194_0048/container_e20_1489740042194_0048_01_000002/pyspark.zip/pyspark/worker.py”,第163行,主栏
func、探查器、反序列化器、序列化器=读取命令(pickleSer、infle)
文件“/data/0/warn/nm/usercache/phanalytics test/appcache/application_1489740042194_0048/container_e20_1489740042194_0048_01_000002/pyspark.zip/pyspark/worker.py”,第54行,在read_命令中
命令=序列化程序。\读取长度为的\u(文件)
文件“/data/0/warn/nm/usercache/phanalytics test/appcache/application_1489740042194_0048/container_e20_1489740042194_0048_01_000002/pyspark.zip/pyspark/serializers.py”,第169行,带长度的“读取”
返回自加载(obj)
文件“/data/0/warn/nm/usercache/phanalytics test/appcache/application_1489740042194_0048/container_e20_1489740042194_0048_01_000002/pyspark.zip/pyspark/serializers.py”,第431行,加载
返回pickle.load(对象,编码=编码)
ImportError:没有名为“UserString”的模块
位于org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
位于org.apache.spark.api.python.PythonRunner$$anon$1。(PythonRDD.scala:234)
位于org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
位于org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287)
位于org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287)
在org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)上
在org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)上
位于org.apache.spark.scheduler.Task.run(Task.scala:99)
位于org.apache.spark.executor.executor$TaskRunner.run(executor.scala:282)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
运行(Thread.java:745)


知道这里出了什么问题吗?谢谢!

您的代码是正确的。您的错误:

ImportError: No module named 'UserString'

引发的原因是,
UserString
不再是Python 3.x中的模块,但它是集合模块的一部分。这表明您正在使用过时的PySpark版本,或者它的一个依赖项已经过时。

您的代码是正确的。错误:

ImportError: No module named 'UserString'

是因为
UserString
不再是Python 3.x中的模块,但它是集合模块的一部分。这表明您使用的是过时的PySpark版本,或者它的一个依赖项已经过时。

我明白了。那么,我如何找到应该更新到哪个版本的PySpark或它的依赖项?的哪个版本你正在使用PySpark?你尝试下载最新版本了吗?我正在使用spark-2.1.0-bin-hadoop2.6我应该更改版本吗?我明白了。那么我如何找到应该更新到哪个PySpark版本或其依赖项?你正在使用哪个PySpark版本?你尝试下载最新版本了吗?我正在使用spark-2.1.0-bin-hadoop2.6我可以改变版本吗?