Python KeyError:“”空字符串错误Pyspark(Spark RDD)
我正在做一个简单的练习,根据好友边缘列表图推荐新朋友,根据一些过滤条件计算每个特定用户的前20个好友。 我正在使用Spark RDD执行此任务 我在all_friends中有一个edgelist,它将好友列表边存储为一个键值对。图形是无向的,因此对于每个“0”、“1”、“1”、“0”也会出现Python KeyError:“”空字符串错误Pyspark(Spark RDD),python,apache-spark,pyspark,rdd,Python,Apache Spark,Pyspark,Rdd,我正在做一个简单的练习,根据好友边缘列表图推荐新朋友,根据一些过滤条件计算每个特定用户的前20个好友。 我正在使用Spark RDD执行此任务 我在all_friends中有一个edgelist,它将好友列表边存储为一个键值对。图形是无向的,因此对于每个“0”、“1”、“1”、“0”也会出现 all_friends.take(4) [('0', '1'), ('0', '2'), ('1', '0'), ('1', '3')] 因此,我的部分代码包含以下内容: from col
all_friends.take(4)
[('0', '1'), ('0', '2'), ('1', '0'), ('1', '3')]
因此,我的部分代码包含以下内容:
from collections import Counter
results = all_friends\
.join(all_friends)\
.filter(filter_conditions)\
.map(lambda af1f2: (af1f2[1][0], af1f2[1][1]))\ #at this point each entry has form [(k,(v1,v2)], hence the lambda expression
.groupByKey()\
.mapValues(lambda v: Counter(v).most_common(20))
然而,在映射之后,我得到一个keyrerror,如下所示。如果我在地图后面放置.keys.collect,也会发生这种情况。这很奇怪,因为我不确定spark为什么要寻找密钥空字符串,而我原来的rdd中显然不存在它。我不确定这是否与完全外部连接有关。有人能提供建议吗
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 78.0 failed 3 times, most recent failure: Lost task 1.2 in stage 78.0 (TID 291, 100.103.89.116, executor 5): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
process()
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/opt/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
return f(*args, **kwargs)
File "<ipython-input-155-140ba198945e>", line 2, in <lambda>
KeyError: ''
筛选条件看起来不正确。下面是一个使用伪过滤器的工作代码
from pyspark import SparkConf
from pyspark.sql import SparkSession
from collections import Counter
conf = SparkConf().setAppName('Python Spark').set("spark.executor.memory", "1g")
spark_session = SparkSession.builder.config(conf=conf).getOrCreate()
all_friends = spark_session.sparkContext.parallelize([('0', '1'), ('0', '2'), ('1', '0'), ('1', '3'), ('1', '3')])
# [('0', '1'), ('0', '2'), ('1', '0'), ('1', '3')]
# print(all_friends.take(4).collect())
def filter_conditions(c):
if c[0] == '1':
return c
results = all_friends.join(all_friends).filter(filter_conditions).map(
lambda af1f2: (af1f2[1][0], af1f2[1][1])).groupByKey().mapValues(lambda v: Counter(v).most_common(20))
print(results.collect())
输出
将链接调用放在单独的行中的单独调用中,以便可以看到是哪个调用导致了问题。
[('3', [('3', 4), ('0', 2)]), ('0', [('3', 2), ('0', 1)])]