Pyspark数据帧中的填充

Pyspark数据帧中的填充,pyspark,spark-dataframe,Pyspark,Spark Dataframe,我有一个Pyspark数据帧(原始数据帧),包含以下数据(所有列都有字符串数据类型): 我需要在value列中创建一个新的修改的数据框,并使用填充,以便此列的长度应为4个字符。如果长度小于4个字符,则在数据中添加0,如下所示: id Value 1 0103 2 1504 3 0001 有人能帮我吗?如何使用Pyspark dataframe实现它?任何帮助都将不胜感激。您

我有一个Pyspark数据帧(原始数据帧),包含以下数据(所有列都有字符串数据类型):

我需要在value列中创建一个新的修改的数据框,并使用填充,以便此列的长度应为4个字符。如果长度小于4个字符,则在数据中添加0,如下所示:

  id             Value
   1             0103
   2             1504
   3             0001  

有人能帮我吗?如何使用Pyspark dataframe实现它?任何帮助都将不胜感激。

您可以从功能模块使用lpad

from pyspark.sql.functions import lpad
>>> df.select('id',lpad(df['value'],4,'0').alias('value')).show()
+---+-----+
| id|value|
+---+-----+
|  1| 0103|
|  2| 1504|
|  3| 0001|
+---+-----+

将PySpark
lpad
函数与
withColumn
结合使用:

import pyspark.sql.functions as F
dfNew = dfOrigin.withColumn('Value', F.lpad(dfOrigin['Value'], 4, '0')) 
**(使用数据集spark java填充列值)**
Functions.lpad(Column,Int32,String)方法:
用pad将字符串列左键填充到给定长度len。如果字符串列长于len,则返回值将缩短为len个字符。
参数
列
要应用的列
长度Int32
填充字符串的长度
填充字符串
用于填充的字符串
例子:
输入csv数据:
+-------+--------+--------+
|姓名|地址|工资|
+-------+--------+--------+
|阿伦|印多尔| 500|
|Shubham | Indore | 1000|
|穆克什|哈里亚纳| 10000|
|阿伦博帕尔100000|
|Shubham | Jabalpur | 1000000|
|穆克什|罗塔克| 10000000|
+-------+--------+--------+
Dataset Dataset=sparkSession.read().选项(“头”、“真”)
.csv(“C:\\Users\\Desktop\\Spark\\user.csv”);
Dataset select=Dataset.select(Dataset.col(“*”),functions.lpad(Dataset.col(“Salary”),7,“0”).as(“SalaryWithLeftPadding”);
select.show();
输出:
+-------+--------+--------+---------------------+
|姓名|地址|工资|带左填充的薪水|
+-------+--------+--------+---------------------+
|阿伦|印多尔| 500 | 0000500|
|Shubham | Indore | 1000 | 0001000|
|穆克什|哈里亚纳| 10000 | 0010000|
|阿伦博帕尔100000 0100000|
|Shubham | Jabalpur | 1000000 | 1000000|
|穆克什|罗塔克| 10000000 | 1000000|
+-------+--------+--------+---------------------+
Functions.rpad(Column,Int32,String)方法:
使用pad将字符串列右键填充到给定长度len。如果字符串列长于len,则返回值将缩短为len个字符。
参数
列
要应用的列
长度Int32
填充字符串的长度
填充字符串
用于填充的字符串
例子:
Dataset Dataset=sparkSession.read().选项(“头”、“真”)
.csv(“C:\\Users\\Desktop\\Spark\\user.csv”);
Dataset select=Dataset.select(Dataset.col(“*”),functions.rpad(Dataset.col(“薪水”),7,“0”).as(“SalaryWithPadding”);
select.show();
输出:
+-------+--------+--------+----------------------+
|姓名|地址|薪水|薪水加上右边填充|
+-------+--------+--------+----------------------+
|阿伦|印多尔| 500 | 5000000|
|Shubham | Indore | 1000 | 1000000|
|穆克什|哈里亚纳| 10000 | 1000000|
|阿伦博帕尔100000 1000000|
|Shubham | Jabalpur | 1000000 | 1000000|
|穆克什|罗塔克| 10000000 | 1000000|
+-------+--------+--------+----------------------+

如果少数列也包含空值,则要向列值添加前缀,请指定。
import pyspark.sql.functions as F
dfNew = dfOrigin.withColumn('Value', F.lpad(dfOrigin['Value'], 4, '0')) 
**(Padding in column value using dataset spark java)**
Functions.lpad(Column, Int32, String) Method :
Left-pad the string column with pad to the given length len. If the string column is longer than len, the return value is shortened to len characters.

Parameters
column Column
Column to apply

length Int32
Length of padding string

padding String
String used for padding

Example:
Input csv data :
+-------+--------+--------+
|   name| address|  salary|
+-------+--------+--------+
|   Arun|  Indore|     500|
|Shubham|  Indore|    1000|
| Mukesh|Hariyana|   10000|
|   Arun|  Bhopal|  100000|
|Shubham|Jabalpur| 1000000|
| Mukesh|  Rohtak|10000000|
+-------+--------+--------+

Dataset<Row> dataset = sparkSession.read().option("header", "true")
                .csv("C:\\Users\\Desktop\\Spark\\user.csv");

Dataset<Row> select = dataset.select(dataset.col("*"),functions.lpad(dataset.col("Salary"), 7, "0").as("SalaryWithLeftPadding"));

select.show();

Output : 
+-------+--------+--------+---------------------+
|   name| address|  salary|SalaryWithLeftPadding|
+-------+--------+--------+---------------------+
|   Arun|  Indore|     500|              0000500|
|Shubham|  Indore|    1000|              0001000|
| Mukesh|Hariyana|   10000|              0010000|
|   Arun|  Bhopal|  100000|              0100000|
|Shubham|Jabalpur| 1000000|              1000000|
| Mukesh|  Rohtak|10000000|              1000000|
+-------+--------+--------+---------------------+

Functions.rpad(Column, Int32, String) Method :
Right-pad the string column with pad to the given length len. If the string column is longer than len, the return value is shortened to len characters.

Parameters
column Column
Column to apply

length Int32
Length of padding string

padding String
String used for padding

Example:

Dataset<Row> dataset = sparkSession.read().option("header", "true")
                .csv("C:\\Users\\Desktop\\Spark\\user.csv");

Dataset<Row> select = dataset.select(dataset.col("*"),functions.rpad(dataset.col("Salary"), 7,"0").as("SalaryWithRightPadding"));

select.show();

Output:
+-------+--------+--------+----------------------+
|   name| address|  salary|SalaryWithRightPadding|
+-------+--------+--------+----------------------+
|   Arun|  Indore|     500|               5000000|
|Shubham|  Indore|    1000|               1000000|
| Mukesh|Hariyana|   10000|               1000000|
|   Arun|  Bhopal|  100000|               1000000|
|Shubham|Jabalpur| 1000000|               1000000|
| Mukesh|  Rohtak|10000000|               1000000|
+-------+--------+--------+----------------------+