Pyspark 如何将列表的RDD转换为压缩列表的RDD?
RDD ( 清单(1、2、3) 列表('A','B','C') 列表('a','b','c') )Pyspark 如何将列表的RDD转换为压缩列表的RDD?,pyspark,bigdata,Pyspark,Bigdata,RDD ( 清单(1、2、3) 列表('A','B','C') 列表('a','b','c') ) d = rdda.zip(rddb) print (d.take(1)) [(1, 'A')] # 1 is key here and 'A' is Value d = d.zip(rddc) print (d.take(1)) [((1, 'A'), 'a')] # (1, 'A') is key here and 'a' is Value print (d.collect()) #Thi
d = rdda.zip(rddb)
print (d.take(1))
[(1, 'A')] # 1 is key here and 'A' is Value
d = d.zip(rddc)
print (d.take(1))
[((1, 'A'), 'a')] # (1, 'A') is key here and 'a' is Value
print (d.collect()) #This wouldn't give us desired output
[((1, 'A'), 'a'), ((2, 'B'), 'b'), ((3, 'C'), 'c')]
#To get the desired output we need to map key and values in the same object/tuple using map
print (d.map(lambda x:x[0]+(x[1], )).take(1))
[(1, 'A', 'a')]
#lambda x:x[0]+(x[1], ) Here x[0] is having tuple of keys (1, 'A') and x[1] is just a string value 'a'. Now concat key tuple and value (convert to tuple (x[1], ))
我想把它转换成
RDD
(
列表(1,'A','A')
列表(2,'B','B')
列表(3,'C','C')
)
d = rdda.zip(rddb)
print (d.take(1))
[(1, 'A')] # 1 is key here and 'A' is Value
d = d.zip(rddc)
print (d.take(1))
[((1, 'A'), 'a')] # (1, 'A') is key here and 'a' is Value
print (d.collect()) #This wouldn't give us desired output
[((1, 'A'), 'a'), ((2, 'B'), 'b'), ((3, 'C'), 'c')]
#To get the desired output we need to map key and values in the same object/tuple using map
print (d.map(lambda x:x[0]+(x[1], )).take(1))
[(1, 'A', 'a')]
#lambda x:x[0]+(x[1], ) Here x[0] is having tuple of keys (1, 'A') and x[1] is just a string value 'a'. Now concat key tuple and value (convert to tuple (x[1], ))
我想在PySpark中执行此操作而不使用collect操作
我尝试了以下方法:
这样我就可以把它转换成数据帧
+--+---+---+
|A1| A2| A3|
+--+---+---+
|1 | A| aa|
|2 | B| bb|
|3 | C| cc|
+--+---+---+
RDD
在(键、值)
对上工作。使用第二个RDD
压缩第一个RDD时,第一个RDD的值成为新RDD的键
,第二个RDD的值成为新RDD的值
现在通过示例1了解-
创建RDD
#Python Lists
a = [1, 2, 3]
b = ['A', 'B', 'C']
c = ['a','b', 'c']
#3 Different RDDS from Python Lists
rdda = sc.parallelize(a)
rddb = sc.parallelize(b)
rddc = sc.parallelize(c)
逐个压缩并检查键、值
对-
d = rdda.zip(rddb)
print (d.take(1))
[(1, 'A')] # 1 is key here and 'A' is Value
d = d.zip(rddc)
print (d.take(1))
[((1, 'A'), 'a')] # (1, 'A') is key here and 'a' is Value
print (d.collect()) #This wouldn't give us desired output
[((1, 'A'), 'a'), ((2, 'B'), 'b'), ((3, 'C'), 'c')]
#To get the desired output we need to map key and values in the same object/tuple using map
print (d.map(lambda x:x[0]+(x[1], )).take(1))
[(1, 'A', 'a')]
#lambda x:x[0]+(x[1], ) Here x[0] is having tuple of keys (1, 'A') and x[1] is just a string value 'a'. Now concat key tuple and value (convert to tuple (x[1], ))
最终转换为DF
d.map(lambda x:x[0]+(x[1], )).toDF().show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| A| a|
| 2| B| b|
| 3| C| c|
+---+---+---+
希望这将帮助您将第二个示例解析为。第二个示例的输出似乎是错误的。通过运行第二个示例,我得到了
(1,('A','A'))(2,('B','B'))(3,('C','C'))
。
d = rdda.zip(rddb)
print (d.take(1))
[(1, 'A')] # 1 is key here and 'A' is Value
d = d.zip(rddc)
print (d.take(1))
[((1, 'A'), 'a')] # (1, 'A') is key here and 'a' is Value
print (d.collect()) #This wouldn't give us desired output
[((1, 'A'), 'a'), ((2, 'B'), 'b'), ((3, 'C'), 'c')]
#To get the desired output we need to map key and values in the same object/tuple using map
print (d.map(lambda x:x[0]+(x[1], )).take(1))
[(1, 'A', 'a')]
#lambda x:x[0]+(x[1], ) Here x[0] is having tuple of keys (1, 'A') and x[1] is just a string value 'a'. Now concat key tuple and value (convert to tuple (x[1], ))