Pyspark 如何将列表的RDD转换为压缩列表的RDD?

Pyspark 如何将列表的RDD转换为压缩列表的RDD?,pyspark,bigdata,Pyspark,Bigdata,RDD ( 清单(1、2、3) 列表('A','B','C') 列表('a','b','c') ) d = rdda.zip(rddb) print (d.take(1)) [(1, 'A')] # 1 is key here and 'A' is Value d = d.zip(rddc) print (d.take(1)) [((1, 'A'), 'a')] # (1, 'A') is key here and 'a' is Value print (d.collect()) #Thi

RDD ( 清单(1、2、3) 列表('A','B','C') 列表('a','b','c') )

d = rdda.zip(rddb)
print (d.take(1))
[(1, 'A')] # 1 is key here and 'A' is Value

d = d.zip(rddc)
print (d.take(1))
[((1, 'A'), 'a')] # (1, 'A') is key here and 'a' is Value

print (d.collect()) #This wouldn't give us desired output
[((1, 'A'), 'a'), ((2, 'B'), 'b'), ((3, 'C'), 'c')]

#To get the desired output we need to map key and values in the same object/tuple using map

print (d.map(lambda x:x[0]+(x[1], )).take(1))
[(1, 'A', 'a')]

#lambda x:x[0]+(x[1], )  Here x[0] is having tuple of keys (1, 'A') and x[1] is just a string value 'a'. Now concat key tuple and value (convert to tuple (x[1], )) 
我想把它转换成

RDD ( 列表(1,'A','A') 列表(2,'B','B') 列表(3,'C','C') )

d = rdda.zip(rddb)
print (d.take(1))
[(1, 'A')] # 1 is key here and 'A' is Value

d = d.zip(rddc)
print (d.take(1))
[((1, 'A'), 'a')] # (1, 'A') is key here and 'a' is Value

print (d.collect()) #This wouldn't give us desired output
[((1, 'A'), 'a'), ((2, 'B'), 'b'), ((3, 'C'), 'c')]

#To get the desired output we need to map key and values in the same object/tuple using map

print (d.map(lambda x:x[0]+(x[1], )).take(1))
[(1, 'A', 'a')]

#lambda x:x[0]+(x[1], )  Here x[0] is having tuple of keys (1, 'A') and x[1] is just a string value 'a'. Now concat key tuple and value (convert to tuple (x[1], )) 
我想在PySpark中执行此操作而不使用collect操作

我尝试了以下方法:

  • 这样我就可以把它转换成数据帧

    +--+---+---+
    |A1| A2| A3|
    +--+---+---+
    |1 |  A| aa|
    |2 |  B| bb|
    |3 |  C| cc|
    +--+---+---+
    

    RDD
    (键、值)
    对上工作。使用
    第二个RDD
    压缩第一个RDD时,第一个RDD的
    值成为新RDD的键
    ,第二个RDD的
    值成为新RDD的值

    现在通过示例1了解-

    创建RDD

    #Python Lists
    a = [1, 2, 3]
    b = ['A', 'B', 'C']
    c = ['a','b', 'c']
    
    #3 Different RDDS from Python Lists
    rdda = sc.parallelize(a)
    rddb = sc.parallelize(b)
    rddc = sc.parallelize(c)
    
    逐个压缩并检查
    键、值
    对-

    d = rdda.zip(rddb)
    print (d.take(1))
    [(1, 'A')] # 1 is key here and 'A' is Value
    
    d = d.zip(rddc)
    print (d.take(1))
    [((1, 'A'), 'a')] # (1, 'A') is key here and 'a' is Value
    
    print (d.collect()) #This wouldn't give us desired output
    [((1, 'A'), 'a'), ((2, 'B'), 'b'), ((3, 'C'), 'c')]
    
    #To get the desired output we need to map key and values in the same object/tuple using map
    
    print (d.map(lambda x:x[0]+(x[1], )).take(1))
    [(1, 'A', 'a')]
    
    #lambda x:x[0]+(x[1], )  Here x[0] is having tuple of keys (1, 'A') and x[1] is just a string value 'a'. Now concat key tuple and value (convert to tuple (x[1], )) 
    
    最终转换为DF

    d.map(lambda x:x[0]+(x[1], )).toDF().show()
    +---+---+---+
    | _1| _2| _3|
    +---+---+---+
    |  1|  A|  a|
    |  2|  B|  b|
    |  3|  C|  c|
    +---+---+---+
    

    希望这将帮助您将第二个示例解析为。

    第二个示例的输出似乎是错误的。通过运行第二个示例,我得到了
    (1,('A','A'))(2,('B','B'))(3,('C','C'))
    d = rdda.zip(rddb)
    print (d.take(1))
    [(1, 'A')] # 1 is key here and 'A' is Value
    
    d = d.zip(rddc)
    print (d.take(1))
    [((1, 'A'), 'a')] # (1, 'A') is key here and 'a' is Value
    
    print (d.collect()) #This wouldn't give us desired output
    [((1, 'A'), 'a'), ((2, 'B'), 'b'), ((3, 'C'), 'c')]
    
    #To get the desired output we need to map key and values in the same object/tuple using map
    
    print (d.map(lambda x:x[0]+(x[1], )).take(1))
    [(1, 'A', 'a')]
    
    #lambda x:x[0]+(x[1], )  Here x[0] is having tuple of keys (1, 'A') and x[1] is just a string value 'a'. Now concat key tuple and value (convert to tuple (x[1], ))