Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 动态添加填充零_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Apache spark 动态添加填充零

Apache spark 动态添加填充零,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,修剪col2并根据长度(10-col2长度)需要在col3中动态添加填充零。连接col2和col3 mock_data = [('TYCO', ' 1303','13'),('EMC', ' 120989 ','123'), ('VOLVO ', '102329 ','1234'),('BMW', '1301571345 ',' '),('FORD', '004','21212')] df = spark.createDataFrame(mock_data, ['col1', 'co

修剪col2并根据长度(10-col2长度)需要在col3中动态添加填充零。连接col2和col3

mock_data = [('TYCO', ' 1303','13'),('EMC', '  120989  ','123'), ('VOLVO  ', '102329  ','1234'),('BMW', '1301571345  ',' '),('FORD', '004','21212')]

df = spark.createDataFrame(mock_data, ['col1', 'col2','col3'])

+-------+------------+-----+

|   col1 |        col2| col3|
 
+-------+------------+-----+

|   TYCO|        1303|   13|

|    EMC|    120989  |  123|

|VOLVO  |    102329  | 1234|

|    BMW|1301571345  |     |

|   FORD|         004|21212|

+-------+------------+-----+
 
预期产量

df2 = df.withColumn('length_col2', 10-length(trim(df.col2)))
+-------+------------+-----+-----------+

|   col1|        col2| col3|length_col2|

+-------+------------+-----+-----------+

|   TYCO|        1303|   13|          6|

|    EMC|    120989  |  123|          4|

|VOLVO  |    102329  | 1234|          4|

|    BMW|1301571345  |     |          0|

|   FORD|         004|21212|          7|

+-------+------------+-----+-----------+

您要查找的是
pyspark.sql.functions
中的
rpad
函数,如下所示=>

请参见下面的解决方案:

+-------+----------+-----+-------------

|   col1|      col2   | col3|output

+-------+----------+-----+-------------

|   TYCO|      1303   |   13|1303000013

|    EMC|  120989     |  123|1209890123

|VOLVO  |  102329     | 1234|1023291234

|    BMW|  1301571345 |     |1301571345

|   FORD|       004   |21212|0040021212

+-------+----------+-----+-------------
结果

%pyspark

mock_data = [('TYCO', ' 1303','13'),('EMC', '  120989  ','123'), ('VOLVO  ', '102329  ','1234'),('BMW', '1301571345  ',' '),('FORD', '004','21212')]

df = spark.createDataFrame(mock_data, ['col1', 'col2','col3'])

df.createOrReplaceTempView("input_df")

spark.sql("SELECT *, concat(rpad(trim(col2),10,'0') , col3) as OUTPUT from input_df").show(20,False)


需要基于列长度_col2对lpad col3进行编码,然后需要使用col2进行编码
+-------+------------+-----+---------------+
|col1   |col2        |col3 |OUTPUT         |
+-------+------------+-----+---------------+
|TYCO   | 1303       |13   |130300000013   |
|EMC    |  120989    |123  |1209890000123  |
|VOLVO  |102329      |1234 |10232900001234 |
|BMW    |1301571345  |     |1301571345     |
|FORD   |004         |21212|004000000021212|
+-------+------------+-----+---------------+