Pyspark flatMapValues(整个列表、元素列表)
我有一个键为整数的rdd。对于每个键,我都有一个字符串列表。示例:Pyspark flatMapValues(整个列表、元素列表),pyspark,rdd,flatmap,Pyspark,Rdd,Flatmap,我有一个键为整数的rdd。对于每个键,我都有一个字符串列表。示例:[(0,['transworld','systems','inc','trying','collect','debt','mine','owed','inc'])] 我想要的是得到一个新的RDD,如下所示: [(0, ['transworld', 'systems', 'inc', 'trying', 'collect', 'debt', 'mine', 'owed', 'inaccurate'],'transworld')] [
[(0,['transworld','systems','inc','trying','collect','debt','mine','owed','inc'])]
我想要的是得到一个新的RDD,如下所示:
[(0, ['transworld', 'systems', 'inc', 'trying', 'collect', 'debt', 'mine', 'owed', 'inaccurate'],'transworld')]
[(0, ['transworld', 'systems', 'inc', 'trying', 'collect', 'debt', 'mine', 'owed', 'inaccurate'],'systems')]
[(0, ['transworld', 'systems', 'inc', 'trying', 'collect', 'debt', 'mine', 'owed', 'inaccurate'],'inc')] etc
我想我需要flatMapValues,但找不到使用它的方法。有人帮忙吗?也许这很有用-
val rdd=spark.sparkContext.parallelize(Seq((0,Seq)(“transworld”、“systems”、“inc”、“trying”、“collect”、“debt”),
“我的”,
“欠的”、“不准确的”))
rdd.flatMap{case(i,seq)=>seq.fill(seq.length)((i,seq)).zip(seq.map(x=>(x._1._1,x._1._2,x._2))}
.foreach(println)
/**
*(0,列表(transworld,systems,inc,trying,collect,debt,mine,欠款,不准确),transworld)
*(0,列表(transworld,systems,inc,trying,collect,debt,mine,欠款,不准确),系统)
*(0,列表(transworld,systems,inc,trying,collect,debt,mine,Owned,Accountable),inc)
*(0,列表(transworld,systems,inc,trying,collect,debt,mine,欠款,不准确),trying)
*(0,列表(transworld,systems,inc,trying,collect,debt,mine,欠款,不准确),collect)
*(0,列表(transworld,systems,inc,trying,collect,debt,mine,欠款,不准确),debt)
*(0,列表(transworld,systems,inc,trying,collect,debt,mine,欠款,不准确),mine)
*(0,列表(transworld,systems,inc,trying,collect,debt,mine,Owned,Accountable),Owned)
*(0,列表(transworld,systems,inc,trying,collect,debt,mine,欠款,不准确),不准确)
*/
flatMap是一种方法:rdd.flatMap(λx:[x+(e,)表示x[1]]中的e)。collect()