Python 如何在PySpark(Spark流媒体)中组合多个RDD?
例如,在Spark Streaming中,我有表单的传入数据-Python 如何在PySpark(Spark流媒体)中组合多个RDD?,python,apache-spark,dataframe,spark-streaming,rdd,Python,Apache Spark,Dataframe,Spark Streaming,Rdd,例如,在Spark Streaming中,我有表单的传入数据- { "id": xx, "a" : 1, "b" : 2, "c" : 3, "d" : 4, "scores"{ "score1" : "", "score2" : "", "score3" : "" } } 处理管道如下所示: def func1(row): row["score"]["score1"]=row["a"]+row["b"] def func2(
{
"id": xx,
"a" : 1,
"b" : 2,
"c" : 3,
"d" : 4,
"scores"{
"score1" : "",
"score2" : "",
"score3" : ""
}
}
处理管道如下所示:
def func1(row):
row["score"]["score1"]=row["a"]+row["b"]
def func2(row):
row["score"]["score2"]=row["b"]+row["c"]
def func3(row):
row["score"]["score3"]=row["c"]+row["a"]
def publish(iter):
# publish to some cloud db
# For Each RDD
def process(rdd):
rdd1 = rdd.map(func1)
rdd2 = rdd1.map(func2)
rdd3 = rdd2.map(func3)
rdd3.foreachPartition(publish)
由于我所有的RDD都是以串行方式创建的,因此我理解通过将流程函数修改为-
def process(rdd):
rdd1 = rdd.map(func1)
rdd2 = rdd.map(func2)
rdd3 = rdd.map(func3)
rdd4 = #combine rdd1, rdd2 rdd3
rdd3.foreachPartition(publish)
我有两个问题-
{
"id": xx,
"a" : 1,
"b" : 2,
"c" : 3,
"d" : 4,
"scores"{
"score1" : "3",
"score2" : "",
"score3" : ""
}
}
{
"id": xx,
"a" : 1,
"b" : 2,
"c" : 3,
"d" : 4,
"scores"{
"score1" : "",
"score2" : "5",
"score3" : ""
}
}
{
"id": xx,
"a" : 1,
"b" : 2,
"c" : 3,
"d" : 4,
"scores"{
"score1" : "",
"score2" : "",
"score3" : "7"
}
}
放入此类行的rdd中-
{
"id": xx,
"a" : 1,
"b" : 2,
"c" : 3,
"d" : 4,
"scores"{
"score1" : "2",
"score2" : "5",
"score3" : "7"
}
}
谢谢 我认为最有可能的合并rdd的解决方案是“进行rdd的并集”,然后执行reduce操作来更新分数。但由于这是一个糟糕的设计,因为所有的并集将累积所有内存。。因此,在这种情况下,您的第一个方法管道过程是好的。