使用pyspark将特定字删除到数据帧中
我有一个数据帧使用pyspark将特定字删除到数据帧中,pyspark,helper,delete-row,word,pyspark-dataframes,Pyspark,Helper,Delete Row,Word,Pyspark Dataframes,我有一个数据帧 +------+--------------------+-----------------+---- | id| titulo |tipo | formacion | +------+--------------------+-----------------+---- |32084|A | Material | VION00001 TRADE | |32350|B | Curso |
+------+--------------------+-----------------+----
| id| titulo |tipo | formacion |
+------+--------------------+-----------------+----
|32084|A | Material | VION00001 TRADE |
|32350|B | Curso | CUS11222 LEADER|
|32362|C | Curso | ITIN9876 EVALUA|
|32347|D | Curso | CUMPLI VION1234 |
|32036|E | Curso | EVAN1111 INFORM|
我需要在formacion列中删除以VION | CUS | ITIN | VION | EVAN开头的字符
+------+--------------------+-----------------+----
| id| titulo |tipo | formacion |
+------+--------------------+-----------------+----
|32084|A | Material | TRADE |
|32350|B | Curso | LEADER |
|32362|C | Curso | EVALUA |
|32347|D | Curso | CUMPLI |
|32036|E | Curso | INFORM |
+------+--------------------+-----------------+----
感谢您的帮助使用
拆分
函数按空格
拆分列,然后获取数组的最后一个元素
- 从
在Spark2.4+
功能中使用
元素
- 对于
使用Spark<2.4
反向(拆分(数组))[0]
#using element_at
df.withColumn("formacion",element_at(split(col("formacion"),"\\s"),-1)).show()
#or using array_index
df.withColumn("formacion",split(col("formacion"),"\\s")[1]).show()
#split reverse and get first index value
df.withColumn("formacion",reverse(split(col("formacion"),"\\s"))[0]).show()
#+-----+--------------+----------+-------------+
#| id|titulo |tipo | formacion |
#+------+--------------------+-----------------+
#|32084|A | Material | TRADE |
#|32350|B | Curso | LEADER |
#|32362|C | Curso | EVALUA |
#|32347|D | Curso | CUMPLI |
#|32036|E | Curso | INFORM |
#+-----+--------------+----------+-------------+