Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/337.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 是否可以在leftOuterJoin上初始化空的默认值?_Python_Apache Spark_Pyspark - Fatal编程技术网

Python 是否可以在leftOuterJoin上初始化空的默认值?

Python 是否可以在leftOuterJoin上初始化空的默认值?,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我有以下两个RDD: name_to_hour = sc.parallelize([("Amy", [7,8,7,18,19]), ("Dan", [6,7]), ("Emily", [1,2,3,7,7,7,2])]) name_biz = sc.parallelize(["Amy", "Brian", "Chris", "Dan", "Emily"]) 我想加入他们,所以我得到的rdd如下所示: [('Amy', [7, 8, 7, 18, 19]), ('Chris', []), ('

我有以下两个RDD:

name_to_hour = sc.parallelize([("Amy", [7,8,7,18,19]), ("Dan", [6,7]), ("Emily", [1,2,3,7,7,7,2])])

name_biz = sc.parallelize(["Amy", "Brian", "Chris", "Dan", "Emily"])
我想加入他们,所以我得到的rdd如下所示:

[('Amy', [7, 8, 7, 18, 19]), ('Chris', []), ('Brian', []), ('Dan', [6, 7]), ('Emily', [1, 2, 3, 7, 7, 7, 2])]
我可以用我认为笨拙的解决方案实现这一点:

from pyspark import SparkContext

sc = SparkContext()

name_to_hour = sc.parallelize([("Amy", [7,8,7,18,19]), ("Dan", [6,7]), ("Emily", [1,2,3,7,7,7,2])])

name_biz = sc.parallelize(["Amy", "Brian", "Chris", "Dan", "Emily"])

temp = name_biz.map(lambda x: (x, []))

joined_rdd = temp.leftOuterJoin(name_to_hour)

def concat(my_tup):
    if my_tup[1] is None:
        return []
    else:
        return my_tup[1]

result_rdd = joined_rdd.map(lambda x: (x[0], concat(x[1])))

print "\033[0;34m{}\033[0m".format(result_rdd.collect())
有更好的方法吗


我在想,如果可以在
leftOuterJoin
上以某种方式指定,非空字段保留它们在
name\u to\u hour
中的内容,空字段获得默认值
[]
,我的问题就可以更容易地解决,但是我不认为有这种方法。

解决这个问题的一种方法是利用Python列表的字典顺序。由于空列表总是“小于”非空列表,我们可以简单地创建一个
联合
,并使用
max
减少:

temp.union(name_to_hour).reduceByKey(max)
当然,这是假设键是唯一的