比较pyspark中两个RDD的每个值
我有两个rdd。例如比较pyspark中两个RDD的每个值,pyspark,Pyspark,我有两个rdd。例如 employee = [(31, ['Raffery', 31, 'a', 'b']), (33, ['Jones', 33, '1', 'b']), (32, ['Heisenberg', 33, 'a', 'b']), (37, ['Robinson', 34, 'c', 'cc']), (38, ['Smith', 34, 'a', 'b'])
employee = [(31, ['Raffery', 31, 'a', 'b']),
(33, ['Jones', 33, '1', 'b']),
(32, ['Heisenberg', 33, 'a', 'b']),
(37, ['Robinson', 34, 'c', 'cc']),
(38, ['Smith', 34, 'a', 'b'])]`
department = [[(31, ['Raffery', 31, 'c', 'b']),
(33, ['Jones', 33, 'a', 'b']),
(34, ['Heisenberg', 33, 'a', 'b'])]`
我想比较每个键的第一个rdd和第二个rdd的元素:
输出应该如下所示
31,故障在e[1][2]
33,故障位于e[1][2]
我不确定输出的格式到底需要多严格,但以下几点几乎可以满足您的所有要求: 使用pyspark数据帧:
>>> employee = spark.createDataFrame([(31, ['Raffery', 31, 'a', 'b']), (33, ['Jones', 33, '1', 'b']), (32, ['Heisenberg', 33, 'a', 'b'])], ["id_e", "list_e"])
>>> employee.show()
+----+----------------------+
|id_e|list_e |
+----+----------------------+
|31 |[Raffery, 31, a, b] |
|33 |[Jones, 33, 1, b] |
|32 |[Heisenberg, 33, a, b]|
+----+----------------------+
>>> department = spark.createDataFrame([(31, ['Raffery', 31, 'c', 'b']), (33, ['Jones', 33, 'a', 'b']), (34, ['Heisenberg', 33, 'a', 'b'])], ["id_d", "list_d"])
>>> department.show()
+----+----------------------+
|id_d|list_d |
+----+----------------------+
|31 |[Raffery, 31, c, b] |
|33 |[Jones, 33, a, b] |
|34 |[Heisenberg, 33, a, b]|
+----+----------------------+
>>> joined.rdd.map(lambda row: (row.id_e, [i for i in range(4) if row.list_d[i] != row.list_e[i]])).collect()
[(31, [2]), (33, [2])]
我假设用户ID为:
>>> joined = employee.join(department, employee.id_e == department.id_d)
>>> joined.show()
+----+-------------------+----+-------------------+
|id_e| list_e|id_d| list_d|
+----+-------------------+----+-------------------+
| 31|[Raffery, 31, a, b]| 31|[Raffery, 31, c, b]|
| 33| [Jones, 33, 1, b]| 33| [Jones, 33, a, b]|
+----+-------------------+----+-------------------+
然后映射数据帧之间未共享的元素的用户列表索引:
>>> employee = spark.createDataFrame([(31, ['Raffery', 31, 'a', 'b']), (33, ['Jones', 33, '1', 'b']), (32, ['Heisenberg', 33, 'a', 'b'])], ["id_e", "list_e"])
>>> employee.show()
+----+----------------------+
|id_e|list_e |
+----+----------------------+
|31 |[Raffery, 31, a, b] |
|33 |[Jones, 33, 1, b] |
|32 |[Heisenberg, 33, a, b]|
+----+----------------------+
>>> department = spark.createDataFrame([(31, ['Raffery', 31, 'c', 'b']), (33, ['Jones', 33, 'a', 'b']), (34, ['Heisenberg', 33, 'a', 'b'])], ["id_d", "list_d"])
>>> department.show()
+----+----------------------+
|id_d|list_d |
+----+----------------------+
|31 |[Raffery, 31, c, b] |
|33 |[Jones, 33, a, b] |
|34 |[Heisenberg, 33, a, b]|
+----+----------------------+
>>> joined.rdd.map(lambda row: (row.id_e, [i for i in range(4) if row.list_d[i] != row.list_e[i]])).collect()
[(31, [2]), (33, [2])]
希望这会让你走上正轨,祝你好运