Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/311.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将行号添加到数据帧pyspark中的连接列_Python_Apache Spark_Pyspark - Fatal编程技术网

Python 将行号添加到数据帧pyspark中的连接列

Python 将行号添加到数据帧pyspark中的连接列,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我在pyspark df = sqlContext.createDataFrame( [(1,'Y','Y',0,0,0,2,'Y','N','Y','Y'), (2,'N','Y',2,1,2,3,'N','Y','Y','N'), (3,'Y','N',3,1,0,0,'N','N','N','N'), (4,'N','Y',5,0,1,0,'N','N','N','Y'), (5,'Y','N',2,2,0,1,'Y','N','N','Y'), (6,'Y','Y',0,0,3,6,'Y

我在
pyspark

df = sqlContext.createDataFrame(
[(1,'Y','Y',0,0,0,2,'Y','N','Y','Y'),
(2,'N','Y',2,1,2,3,'N','Y','Y','N'),
(3,'Y','N',3,1,0,0,'N','N','N','N'),
(4,'N','Y',5,0,1,0,'N','N','N','Y'),
(5,'Y','N',2,2,0,1,'Y','N','N','Y'),
(6,'Y','Y',0,0,3,6,'Y','N','Y','N'),
(7,'N','N',1,1,3,4,'N','Y','N','Y'),
(8,'Y','Y',1,1,2,0,'Y','Y','N','N')
],
('id', 'compatible', 'product', 'ios', 'pc', 'other', 'devices', 'customer', 'subscriber', 'circle', 'smb')
)
现在我想通过连接一些字符串在数据帧中创建一个新列
bt_string
。我做了如下的事情

import pyspark.sql.functions as f
from datetime import datetime
from time import strftime
from pyspark.sql import Window

# the below values will change as per requirement
job_id = '123'
sess_id = '99'
batch_id = '1'
time_now = datetime.now().strftime('%Y%m%d%H%M%S')

con_string = job_id + sess_id + batch_id + time_now + '000000000000000'

df1 = df.withColumn('bt_string', f.lit(con_string))
df2 = df1.withColumn("row_id",f.row_number().over(Window.partitionBy()))
1239912021030112091500000000000001
1239912021030112091500000000000002
1239912021030112091500000000000003
1239912021030112091500000000000004
1239912021030112091500000000000005
1239912021030112091500000000000006
1239912021030112091500000000000007
1239912021030112091500000000000008
现在,对于数据帧,我想为每一行指定一个唯一的编号。我应用了
row\u number
函数,如下所示

import pyspark.sql.functions as f
from datetime import datetime
from time import strftime
from pyspark.sql import Window

# the below values will change as per requirement
job_id = '123'
sess_id = '99'
batch_id = '1'
time_now = datetime.now().strftime('%Y%m%d%H%M%S')

con_string = job_id + sess_id + batch_id + time_now + '000000000000000'

df1 = df.withColumn('bt_string', f.lit(con_string))
df2 = df1.withColumn("row_id",f.row_number().over(Window.partitionBy()))
1239912021030112091500000000000001
1239912021030112091500000000000002
1239912021030112091500000000000003
1239912021030112091500000000000004
1239912021030112091500000000000005
1239912021030112091500000000000006
1239912021030112091500000000000007
1239912021030112091500000000000008
输出如下

df2.show()  

+---+----------+-------+---+---+-----+-------+--------+----------+------+---+--------------------+------+
| id|compatible|product|ios| pc|other|devices|customer|subscriber|circle|smb|           bt_string|row_id|
+---+----------+-------+---+---+-----+-------+--------+----------+------+---+--------------------+------+
|  1|         Y|      Y|  0|  0|    0|      2|       Y|         N|     Y|  Y|12399120210301120...|     1|
|  2|         N|      Y|  2|  1|    2|      3|       N|         Y|     Y|  N|12399120210301120...|     2|
|  3|         Y|      N|  3|  1|    0|      0|       N|         N|     N|  N|12399120210301120...|     3|
|  4|         N|      Y|  5|  0|    1|      0|       N|         N|     N|  Y|12399120210301120...|     4|
|  5|         Y|      N|  2|  2|    0|      1|       Y|         N|     N|  Y|12399120210301120...|     5|
|  6|         Y|      Y|  0|  0|    3|      6|       Y|         N|     Y|  N|12399120210301120...|     6|
|  7|         N|      N|  1|  1|    3|      4|       N|         Y|     N|  Y|12399120210301120...|     7|
|  8|         Y|      Y|  1|  1|    2|      0|       Y|         Y|     N|  N|12399120210301120...|     8|
+---+----------+-------+---+---+-----+-------+--------+----------+------+---+--------------------+------+
现在我想将
行id
列添加到
bt\u字符串
列中。我的意思是像下面这样

import pyspark.sql.functions as f
from datetime import datetime
from time import strftime
from pyspark.sql import Window

# the below values will change as per requirement
job_id = '123'
sess_id = '99'
batch_id = '1'
time_now = datetime.now().strftime('%Y%m%d%H%M%S')

con_string = job_id + sess_id + batch_id + time_now + '000000000000000'

df1 = df.withColumn('bt_string', f.lit(con_string))
df2 = df1.withColumn("row_id",f.row_number().over(Window.partitionBy()))
1239912021030112091500000000000001
1239912021030112091500000000000002
1239912021030112091500000000000003
1239912021030112091500000000000004
1239912021030112091500000000000005
1239912021030112091500000000000006
1239912021030112091500000000000007
1239912021030112091500000000000008
如果
1st
行的
bt_字符串

1239912021030112091500000000000000 then add the corresponding row_id value. 
In the case of first row the value will be 1239912021030112091500000000000001
创建的新列应具有如下值

import pyspark.sql.functions as f
from datetime import datetime
from time import strftime
from pyspark.sql import Window

# the below values will change as per requirement
job_id = '123'
sess_id = '99'
batch_id = '1'
time_now = datetime.now().strftime('%Y%m%d%H%M%S')

con_string = job_id + sess_id + batch_id + time_now + '000000000000000'

df1 = df.withColumn('bt_string', f.lit(con_string))
df2 = df1.withColumn("row_id",f.row_number().over(Window.partitionBy()))
1239912021030112091500000000000001
1239912021030112091500000000000002
1239912021030112091500000000000003
1239912021030112091500000000000004
1239912021030112091500000000000005
1239912021030112091500000000000006
1239912021030112091500000000000007
1239912021030112091500000000000008
还需要确保列的长度始终为35个字符

以下字符串无论如何不得超过
35
个字符长度

con_string = job_id + sess_id + batch_id + time_now + '000000000000000' 
如果长度超过
35个
字符,则我们需要
trim
上述语句中添加的
zero
数量


如何实现我想要的

按照以下步骤实现您的结果

# import necessary functions
import pyspark.sql.functions as f
from datetime import datetime
from time import strftime
from pyspark.sql import Window

# assign variables as per requirement 
job_id = '123'
sess_id = '99'
batch_id = '1'
time_now = datetime.now().strftime('%Y%m%d%H%M%S')

# Join variables to get desired format of base string
con_string =  job_id + sess_id + batch_id + time_now

# check length of base string and subtract from max length for that column 35 
zero_to_add = 35 - len(con_string)

# Add the numbers of zeros based on the value received above
new_bt_string = con_string + zero_to_add * '0'

# add new column and convert column to decimal and then apply row_number
df1 = df.withColumn('bt_string', f.lit(new_bt_string).cast('decimal(35,0)'))\
    .withColumn("row_id",f.row_number().over(Window.partitionBy()))

# add new column by sum of values from above added columns
df2 = df1.withColumn('bt_id', f.expr('bt_string + row_id'))

可能类似于:`df2['new_column']=df.apply(lambda row:str(int(row[bt_string])+row['row_id'])?也就是说,转换为整数,添加它们,然后转换回字符串?