Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/358.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从列表长度与数据帧行数相同的列表生成pyspark数据帧列_Python_List_Dataframe_Pyspark - Fatal编程技术网

Python 从列表长度与数据帧行数相同的列表生成pyspark数据帧列

Python 从列表长度与数据帧行数相同的列表生成pyspark数据帧列,python,list,dataframe,pyspark,Python,List,Dataframe,Pyspark,我有一个现有的pyspark数据帧,它有170列和841行。我正在寻找添加另一列到它是一个'字符串'列表。列表长度为841,名称为总计 >>> totals ['165024392279', '672183', '1002643', '202292', '216254163906', '4698279464', '9247442818', '60093051178', '22208366804', '994475', '12174', '9404969384', '3211

我有一个现有的pyspark数据帧,它有170列和841行。我正在寻找添加另一列到它是一个'字符串'列表。列表长度为841,名称为总计

  >>> totals
['165024392279', '672183', '1002643', '202292', '216254163906', '4698279464', '9247442818', '60093051178', '22208366804', '994475', '12174', '9404969384', '32118344368', '857443', '48544', '24572495416', '43802661492', '35686122552', '780813', '35414800642', '661474', '531615', '31962803064', '111295163538', '531671', '25776968294', '78538019255', '152455113964', '39305504103', '325507', '1028244', '82294034461', '715748', '12705147430', '678604', '90303771130', '1372443', '362131', '59079186929', '436218', '79528', '41366', '89254591311'...]
其中一种方法是创建一个新的数据帧并将其与主数据帧连接

new_df = sqlContext.createDataFrame([Row(**{'3G-fixated voice users':t})for t in totals])  
因此,new_df有一列841行。并且它不能连接到原始数据帧,因为没有公共列可以连接

我能想到的另一种不成熟的方法是使用文字

from pyspark.sql.functions  import array,lit
totals=[str(t) for t in totals]
test_lit = array([array([lit(t) for t in tt]) for tt in totals])
big_df.withColumn('3G-fixated voice users',test_lit)
这将添加类型为的新列

array<array<string>>
数组
所有的值都只在第一行,这是不需要的

当列表的长度与数据框中的行数相同时,是否有方法从列表中添加新列

使用pyspark还是新手

希望这有帮助

from pyspark.sql.functions import monotonically_increasing_id
df = sc.parallelize([(1,2,3,4,5),(6,7,8,9,10),(16,17,18,19,20)]).toDF(['col1','col2','col3','col4','col5'])
df = df.withColumn("row_id", monotonically_increasing_id())

totals_df = sc.parallelize(['xxx','yyy','zzz']).map(lambda x: (x, )).toDF(['totals'])
totals_df = totals_df.withColumn("row_id", monotonically_increasing_id())

final_df = df.join(totals_df, df.row_id == totals_df.row_id)
final_df = final_df.select([c for c in final_df.columns if c not in {'row_id'}])
final_df.show()

别忘了让我们知道它是否解决了您的问题:)