Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/security/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Pyspark在列表和元组中拆分列表_Python_Apache Spark_Pyspark - Fatal编程技术网

Python Pyspark在列表和元组中拆分列表

Python Pyspark在列表和元组中拆分列表,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我有以下几点 [('HOMICIDE', [('2017', 1)]), ('DECEPTIVE PRACTICE', [('2017', 14), ('2016', 14), ('2015', 10), ('2013', 4), ('2014', 3)]), ('ROBBERY', [('2017', 1)])] 如何转换为 [('HOMICIDE', ('2017', 1)), ('DECEPTIVE PRACTICE', ('2015', 10)), ('DECEPTIV

我有以下几点

[('HOMICIDE', [('2017', 1)]), 
 ('DECEPTIVE PRACTICE', [('2017', 14), ('2016', 14), ('2015', 10), ('2013', 4), ('2014', 3)]), 
 ('ROBBERY', [('2017', 1)])]
如何转换为

[('HOMICIDE', ('2017', 1)), 
 ('DECEPTIVE PRACTICE', ('2015', 10)), 
 ('DECEPTIVE PRACTICE', ('2014', 3)), 
 ('DECEPTIVE PRACTICE', ('2017', 14)), 
 ('DECEPTIVE PRACTICE', ('2016', 14))]
当我尝试使用map时,它的抛出方式是“AttributeError:‘list’对象没有属性‘map’”

rdd = sc.parallelize([('HOMICIDE', [('2017', 1)]), ('DECEPTIVE PRACTICE', [('2017', 14), ('2016', 14), ('2015', 10), ('2013', 4), ('2014', 3)])])
y = rdd.map(lambda x : (x[0],tuple(x[1])))

不如换成一份清单

y = [(x[0], i) for x in rdd for i in x[1]]
返回

[('HOMICIDE', ('2017', 1)), ('DECEPTIVE PRACTICE', ('2017', 14)), ('DECEPTIVE PRACTICE', ('2016', 14)), ('DECEPTIVE PRACTICE', ('2015', 10)), ('DECEPTIVE PRACTICE', ('2013', 4)), ('DECEPTIVE PRACTICE', ('2014', 3))]

不如换成一份清单

y = [(x[0], i) for x in rdd for i in x[1]]
返回

[('HOMICIDE', ('2017', 1)), ('DECEPTIVE PRACTICE', ('2017', 14)), ('DECEPTIVE PRACTICE', ('2016', 14)), ('DECEPTIVE PRACTICE', ('2015', 10)), ('DECEPTIVE PRACTICE', ('2013', 4)), ('DECEPTIVE PRACTICE', ('2014', 3))]

map
是一种基于
rdd
而非python list的方法,因此您需要先对列表进行并行化,然后使用
flatMap
将内部列表展平:

rdd = sc.parallelize([('HOMICIDE', [('2017', 1)]), 
                      ('DECEPTIVE PRACTICE', [('2017', 14), ('2016', 14), ('2015', 10), ('2013', 4), ('2014', 3)]), 
                      ('ROBBERY', [('2017', 1)])])

rdd.flatMap(lambda x: [(x[0], y) for y in x[1]]).collect()

# [('HOMICIDE', ('2017', 1)), 
#  ('DECEPTIVE PRACTICE', ('2017', 14)), 
#  ('DECEPTIVE PRACTICE', ('2016', 14)), 
#  ('DECEPTIVE PRACTICE', ('2015', 10)), 
#  ('DECEPTIVE PRACTICE', ('2013', 4)), 
#  ('DECEPTIVE PRACTICE', ('2014', 3)), 
#  ('ROBBERY', ('2017', 1))]

map
是一种基于
rdd
而非python list的方法,因此您需要先对列表进行并行化,然后使用
flatMap
将内部列表展平:

rdd = sc.parallelize([('HOMICIDE', [('2017', 1)]), 
                      ('DECEPTIVE PRACTICE', [('2017', 14), ('2016', 14), ('2015', 10), ('2013', 4), ('2014', 3)]), 
                      ('ROBBERY', [('2017', 1)])])

rdd.flatMap(lambda x: [(x[0], y) for y in x[1]]).collect()

# [('HOMICIDE', ('2017', 1)), 
#  ('DECEPTIVE PRACTICE', ('2017', 14)), 
#  ('DECEPTIVE PRACTICE', ('2016', 14)), 
#  ('DECEPTIVE PRACTICE', ('2015', 10)), 
#  ('DECEPTIVE PRACTICE', ('2013', 4)), 
#  ('DECEPTIVE PRACTICE', ('2014', 3)), 
#  ('ROBBERY', ('2017', 1))]

它在python中运行良好,当我使用pyspark时,我必须将数据移动到磁盘。。。。我认为我的问题中没有提到sc.parallelize是不好的。谢谢@asongtoruin@SachinSukumaran我的错!另一个答案似乎已经被你涵盖了。它在python中运行良好,当我使用pyspark时,我必须将数据移动到磁盘。。。。我认为我的问题中没有提到sc.parallelize是不好的。谢谢@asongtoruin@SachinSukumaran我的错!另一个答案似乎已经涵盖了你。