Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/email/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
比较pyspark中两个RDD的每个值_Pyspark - Fatal编程技术网

比较pyspark中两个RDD的每个值

比较pyspark中两个RDD的每个值,pyspark,Pyspark,我有两个rdd。例如 employee = [(31, ['Raffery', 31, 'a', 'b']), (33, ['Jones', 33, '1', 'b']), (32, ['Heisenberg', 33, 'a', 'b']), (37, ['Robinson', 34, 'c', 'cc']), (38, ['Smith', 34, 'a', 'b'])

我有两个rdd。例如

employee =    [(31, ['Raffery', 31, 'a', 'b']),
               (33, ['Jones', 33, '1', 'b']),
               (32, ['Heisenberg', 33, 'a', 'b']),
               (37, ['Robinson', 34, 'c', 'cc']),
               (38, ['Smith', 34, 'a', 'b'])]` 

department =   [[(31, ['Raffery', 31, 'c', 'b']),
                 (33, ['Jones', 33, 'a', 'b']),
                 (34, ['Heisenberg', 33, 'a', 'b'])]`
我想比较每个键的第一个rdd和第二个rdd的元素:

输出应该如下所示

31,故障在e[1][2]

33,故障位于e[1][2]


我不确定输出的格式到底需要多严格,但以下几点几乎可以满足您的所有要求:

使用pyspark数据帧:

>>> employee = spark.createDataFrame([(31, ['Raffery', 31, 'a', 'b']), (33, ['Jones', 33, '1', 'b']), (32, ['Heisenberg', 33, 'a', 'b'])], ["id_e", "list_e"])
>>> employee.show()
+----+----------------------+
|id_e|list_e                |
+----+----------------------+
|31  |[Raffery, 31, a, b]   |
|33  |[Jones, 33, 1, b]     |
|32  |[Heisenberg, 33, a, b]|
+----+----------------------+

>>> department = spark.createDataFrame([(31, ['Raffery', 31, 'c', 'b']), (33, ['Jones', 33, 'a', 'b']), (34, ['Heisenberg', 33, 'a', 'b'])], ["id_d", "list_d"])
>>> department.show()
+----+----------------------+
|id_d|list_d                |
+----+----------------------+
|31  |[Raffery, 31, c, b]   |
|33  |[Jones, 33, a, b]     |
|34  |[Heisenberg, 33, a, b]|
+----+----------------------+
>>> joined.rdd.map(lambda row: (row.id_e, [i for i in range(4) if row.list_d[i] != row.list_e[i]])).collect()
[(31, [2]), (33, [2])]
我假设用户ID为:

>>> joined = employee.join(department, employee.id_e == department.id_d)
>>> joined.show()
+----+-------------------+----+-------------------+
|id_e|             list_e|id_d|             list_d|
+----+-------------------+----+-------------------+
|  31|[Raffery, 31, a, b]|  31|[Raffery, 31, c, b]|
|  33|  [Jones, 33, 1, b]|  33|  [Jones, 33, a, b]|
+----+-------------------+----+-------------------+
然后映射数据帧之间未共享的元素的用户列表索引:

>>> employee = spark.createDataFrame([(31, ['Raffery', 31, 'a', 'b']), (33, ['Jones', 33, '1', 'b']), (32, ['Heisenberg', 33, 'a', 'b'])], ["id_e", "list_e"])
>>> employee.show()
+----+----------------------+
|id_e|list_e                |
+----+----------------------+
|31  |[Raffery, 31, a, b]   |
|33  |[Jones, 33, 1, b]     |
|32  |[Heisenberg, 33, a, b]|
+----+----------------------+

>>> department = spark.createDataFrame([(31, ['Raffery', 31, 'c', 'b']), (33, ['Jones', 33, 'a', 'b']), (34, ['Heisenberg', 33, 'a', 'b'])], ["id_d", "list_d"])
>>> department.show()
+----+----------------------+
|id_d|list_d                |
+----+----------------------+
|31  |[Raffery, 31, c, b]   |
|33  |[Jones, 33, a, b]     |
|34  |[Heisenberg, 33, a, b]|
+----+----------------------+
>>> joined.rdd.map(lambda row: (row.id_e, [i for i in range(4) if row.list_d[i] != row.list_e[i]])).collect()
[(31, [2]), (33, [2])]
希望这会让你走上正轨,祝你好运