Pyspark 如何将列表的RDD转换为压缩列表的RDD？_Pyspark_Bigdata

Pyspark 如何将列表的RDD转换为压缩列表的RDD？

pyspark

Pyspark 如何将列表的RDD转换为压缩列表的RDD？,pyspark,bigdata,Pyspark,Bigdata,RDD ( 清单（1、2、3）列表（'A'，'B'，'C'）列表（'a'，'b'，'c'） ) d = rdda.zip(rddb) print (d.take(1)) [(1, 'A')] # 1 is key here and 'A' is Value d = d.zip(rddc) print (d.take(1)) [((1, 'A'), 'a')] # (1, 'A') is key here and 'a' is Value print (d.collect()) #Thi

RDD ( 清单（1、2、3）列表（'A'，'B'，'C'）列表（'a'，'b'，'c'） )

d = rdda.zip(rddb)
print (d.take(1))
[(1, 'A')] # 1 is key here and 'A' is Value

d = d.zip(rddc)
print (d.take(1))
[((1, 'A'), 'a')] # (1, 'A') is key here and 'a' is Value

print (d.collect()) #This wouldn't give us desired output
[((1, 'A'), 'a'), ((2, 'B'), 'b'), ((3, 'C'), 'c')]

#To get the desired output we need to map key and values in the same object/tuple using map

print (d.map(lambda x:x[0]+(x[1], )).take(1))
[(1, 'A', 'a')]

#lambda x:x[0]+(x[1], )  Here x[0] is having tuple of keys (1, 'A') and x[1] is just a string value 'a'. Now concat key tuple and value (convert to tuple (x[1], ))

我想把它转换成

RDD ( 列表（1，'A'，'A'）列表（2，'B'，'B'）列表（3，'C'，'C'） )

d = rdda.zip(rddb)
print (d.take(1))
[(1, 'A')] # 1 is key here and 'A' is Value

d = d.zip(rddc)
print (d.take(1))
[((1, 'A'), 'a')] # (1, 'A') is key here and 'a' is Value

print (d.collect()) #This wouldn't give us desired output
[((1, 'A'), 'a'), ((2, 'B'), 'b'), ((3, 'C'), 'c')]

#To get the desired output we need to map key and values in the same object/tuple using map

print (d.map(lambda x:x[0]+(x[1], )).take(1))
[(1, 'A', 'a')]

#lambda x:x[0]+(x[1], )  Here x[0] is having tuple of keys (1, 'A') and x[1] is just a string value 'a'. Now concat key tuple and value (convert to tuple (x[1], ))

我想在PySpark中执行此操作而不使用collect操作

我尝试了以下方法：

这样我就可以把它转换成数据帧

+--+---+---+
|A1| A2| A3|
+--+---+---+
|1 |  A| aa|
|2 |  B| bb|
|3 |  C| cc|
+--+---+---+

RDD

在

（键、值）

对上工作。使用

第二个RDD

压缩第一个RDD时，第一个RDD的

值成为新RDD的键

，第二个RDD的

值成为新RDD的值

现在通过示例1了解-

创建RDD

#Python Lists
a = [1, 2, 3]
b = ['A', 'B', 'C']
c = ['a','b', 'c']

#3 Different RDDS from Python Lists
rdda = sc.parallelize(a)
rddb = sc.parallelize(b)
rddc = sc.parallelize(c)

逐个压缩并检查
键、值
对-

d = rdda.zip(rddb)
print (d.take(1))
[(1, 'A')] # 1 is key here and 'A' is Value

d = d.zip(rddc)
print (d.take(1))
[((1, 'A'), 'a')] # (1, 'A') is key here and 'a' is Value

print (d.collect()) #This wouldn't give us desired output
[((1, 'A'), 'a'), ((2, 'B'), 'b'), ((3, 'C'), 'c')]

#To get the desired output we need to map key and values in the same object/tuple using map

print (d.map(lambda x:x[0]+(x[1], )).take(1))
[(1, 'A', 'a')]

#lambda x:x[0]+(x[1], )  Here x[0] is having tuple of keys (1, 'A') and x[1] is just a string value 'a'. Now concat key tuple and value (convert to tuple (x[1], ))

最终转换为DF

d.map(lambda x:x[0]+(x[1], )).toDF().show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
|  1|  A|  a|
|  2|  B|  b|
|  3|  C|  c|
+---+---+---+

希望这将帮助您将第二个示例解析为。

第二个示例的输出似乎是错误的。通过运行第二个示例，我得到了

（1，（'A'，'A'））（2，（'B'，'B'））（3，（'C'，'C'））

。

d = rdda.zip(rddb)
print (d.take(1))
[(1, 'A')] # 1 is key here and 'A' is Value

d = d.zip(rddc)
print (d.take(1))
[((1, 'A'), 'a')] # (1, 'A') is key here and 'a' is Value

print (d.collect()) #This wouldn't give us desired output
[((1, 'A'), 'a'), ((2, 'B'), 'b'), ((3, 'C'), 'c')]

#To get the desired output we need to map key and values in the same object/tuple using map

print (d.map(lambda x:x[0]+(x[1], )).take(1))
[(1, 'A', 'a')]

#lambda x:x[0]+(x[1], )  Here x[0] is having tuple of keys (1, 'A') and x[1] is just a string value 'a'. Now concat key tuple and value (convert to tuple (x[1], ))