PySpark-如何连接字符串前缀0';根据条件,将s转换为另一个字符串列

PySpark-如何连接字符串前缀0';根据条件,将s转换为另一个字符串列,pyspark,pyspark-sql,Pyspark,Pyspark Sql,我有一个如下所示的数据框 |string_code|prefix_string_code| |1234 |001234 | |123 |000123 | |56789 |056789 | 基本上,我想要添加的是尽可能多的“0”,这样列前缀\u string\u code的长度将是6 我所尝试的: df.withColumn('prefix_string_code', when(length(c

我有一个如下所示的数据框

|string_code|prefix_string_code|
|1234       |001234            |
|123        |000123            |
|56789      |056789            |
基本上,我想要添加的是尽可能多的“0”,这样列
前缀\u string\u code
的长度将是
6

我所尝试的:

df.withColumn('prefix_string_code', when(length(col('string_code')) < 6, concat(lit('0' * (6 - length(col('string_code')))), col('string_code'))).otherwise(col('string_code')))
如您所见,如果不是十进制形式,代码实际上可以工作。我如何正确地做到这一点


谢谢

在这种情况下,可以使用lpad函数

>>> import pyspark.sql.functions as F

>>> rdd = sc.parallelize([1234,123,56789,1234567])
>>> data = rdd.map(lambda x: Row(x))
>>> df=spark.createDataFrame(data,['string_code'])
>>> df.show()
+-----------+
|string_code|
+-----------+
|       1234|
|        123|
|      56789|
|    1234567|
+-----------+

>>> df.withColumn('prefix_string_code', F.when(F.length(df['string_code']) < 6 ,F.lpad(df['string_code'],6,'0')).otherwise(df['string_code'])).show()
+-----------+------------------+
|string_code|prefix_string_code|
+-----------+------------------+
|       1234|            001234|
|        123|            000123|
|      56789|            056789|
|    1234567|           1234567|
+-----------+------------------+
>>将pyspark.sql.functions导入为F
>>>rdd=sc.parallelize([1234123567891234567])
>>>data=rdd.map(λx:行(x))
>>>df=spark.createDataFrame(数据,['string\u code'])
>>>df.show()
+-----------+
|字符串编码|
+-----------+
|       1234|
|        123|
|      56789|
|    1234567|
+-----------+
>>>df.withColumn('prefix_string_code',F.when(F.length(df['string_code'])小于6,F.lpad(df['string_code'],6,'0'))。否则(df['string_code'])。show()
+-----------+------------------+
|字符串|前缀(字符串)代码|
+-----------+------------------+
|       1234|            001234|
|        123|            000123|
|      56789|            056789|
|    1234567|           1234567|
+-----------+------------------+

对于这种情况,您可以使用lpad函数

>>> import pyspark.sql.functions as F

>>> rdd = sc.parallelize([1234,123,56789,1234567])
>>> data = rdd.map(lambda x: Row(x))
>>> df=spark.createDataFrame(data,['string_code'])
>>> df.show()
+-----------+
|string_code|
+-----------+
|       1234|
|        123|
|      56789|
|    1234567|
+-----------+

>>> df.withColumn('prefix_string_code', F.when(F.length(df['string_code']) < 6 ,F.lpad(df['string_code'],6,'0')).otherwise(df['string_code'])).show()
+-----------+------------------+
|string_code|prefix_string_code|
+-----------+------------------+
|       1234|            001234|
|        123|            000123|
|      56789|            056789|
|    1234567|           1234567|
+-----------+------------------+
>>将pyspark.sql.functions导入为F
>>>rdd=sc.parallelize([1234123567891234567])
>>>data=rdd.map(λx:行(x))
>>>df=spark.createDataFrame(数据,['string\u code'])
>>>df.show()
+-----------+
|字符串编码|
+-----------+
|       1234|
|        123|
|      56789|
|    1234567|
+-----------+
>>>df.withColumn('prefix_string_code',F.when(F.length(df['string_code'])小于6,F.lpad(df['string_code'],6,'0'))。否则(df['string_code'])。show()
+-----------+------------------+
|字符串|前缀(字符串)代码|
+-----------+------------------+
|       1234|            001234|
|        123|            000123|
|      56789|            056789|
|    1234567|           1234567|
+-----------+------------------+