Apache spark 如何将reduceByKey函数应用于PySpark中的.distinct()对象?

Apache spark 如何将reduceByKey函数应用于PySpark中的.distinct()对象?,apache-spark,pyspark,Apache Spark,Pyspark,我目前正在尝试使用PySpark为集合[(1,“冬天来了”),(2,“我们的是愤怒”),(3,“老的是真的是勇敢的”)]创建布尔索引,其中输出是键-值对,其中键是唯一的字,而值是包含单词的集合原始键的列表。 首先,我使用以下代码对我的收藏进行了并行化: collection=sc.parallelize([(1, "winter is coming"), (2, "ours is the fury"), (3, "the old the true the brave")]) collectio

我目前正在尝试使用PySpark为集合
[(1,“冬天来了”),(2,“我们的是愤怒”),(3,“老的是真的是勇敢的”)]创建布尔索引,其中输出是键-值对,其中键是唯一的字,而值是包含单词的集合原始键的列表。
首先,我使用以下代码对我的收藏进行了并行化:

collection=sc.parallelize([(1, "winter is coming"), (2, "ours is the fury"), (3, "the old the true the brave")])
collection.map(lambda x:(x[0],x[1].split(" "))).flatMapValues(lambda x:x).map(lambda x:(x[1],[x[0]])).distinct().reduceByKey(lambda x,y:x+y).collect()
然后,我继续使用以下代码创建索引:

collection=sc.parallelize([(1, "winter is coming"), (2, "ours is the fury"), (3, "the old the true the brave")])
collection.map(lambda x:(x[0],x[1].split(" "))).flatMapValues(lambda x:x).map(lambda x:(x[1],[x[0]])).distinct().reduceByKey(lambda x,y:x+y).collect()
但是,在运行该行之后,我出现了一个错误:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-184-d6bc1884fb69> in <module>()
      1 collection=sc.parallelize([(1, "winter is coming"), (2, "ours is the fury"), (3, "the old the true the brave")])
----> 2 collection.map(lambda x:(x[0],x[1].split(" "))).flatMapValues(lambda x:x).map(lambda x:(x[1],[x[0]])).distinct().reduceByKey(lambda x,y:x+y).collect()

3 frames
/content/spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 150.0 failed 1 times, most recent failure: Lost task 1.0 in stage 150.0 (TID 260, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/rdd.py", line 2499, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/rdd.py", line 2499, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/rdd.py", line 352, in func
    return f(iterator)
  File "/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/rdd.py", line 1861, in combineLocally
    merger.mergeValues(iterator)
  File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py", line 240, in mergeValues
    d[k] = comb(d[k], v) if k in d else creator(v)
TypeError: unhashable type: 'list'

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
    at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:990)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:989)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/rdd.py", line 2499, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/rdd.py", line 2499, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/rdd.py", line 352, in func
    return f(iterator)
  File "/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/rdd.py", line 1861, in combineLocally
    merger.mergeValues(iterator)
  File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py", line 240, in mergeValues
    d[k] = comb(d[k], v) if k in d else creator(v)
TypeError: unhashable type: 'list'

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
    at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
除了键“the”包含它所在的原始元组的键的重复值之外,一切都很好。i、 e.3,如下所示:

[('is', [1, 2]),
 ('true', [3]),
 ('brave', [3]),
 ('winter', [1]),
 ('coming', [1]),
 ('ours', [2]),
 ('the', [2, 3, 3, 3]),
 ('fury', [2]),
 ('old', [3])]

因此,我的问题是,在将键
'the'
输入到
reduceByKey()
函数之前,如何删除该键的重复值?提前谢谢

您可以删除由split()创建的列表以获得所需的结果

collection.map(lambda x:(x[0],list(set(x[1].split(" "))))).flatMapValues(lambda x:x).map(lambda x:(x[1],[x[0]])).reduceByKey(lambda x,y:x+y).collect()
输出:
[('fury',[2]),('true',[3]),('is',[1,2]),('old',[3]),('the',[2,3]),('ours',[2]),('brave',[3]),('winter',[1]),('coming',[1])