Apache spark Pyspark数据帧中的regexp_替换

Apache spark Pyspark数据帧中的regexp_替换,apache-spark,hadoop,pyspark,apache-spark-sql,pyspark-dataframes,Apache Spark,Hadoop,Pyspark,Apache Spark Sql,Pyspark Dataframes,我在Pyspark数据帧上运行了regexp\u replace命令,之后所有数据的数据类型都变为字符串。为什么会这样 下面是我在使用regex_replace之前的表格 root |-- account_id: long (nullable = true) |-- credit_card_limit: long (nullable = true) |-- credit_card_number: long (nullable = true) |-- first_name: string

我在Pyspark数据帧上运行了
regexp\u replace
命令,之后所有数据的数据类型都变为字符串。为什么会这样

下面是我在使用regex_replace之前的表格

root
 |-- account_id: long (nullable = true)
 |-- credit_card_limit: long (nullable = true)
 |-- credit_card_number: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_number: long (nullable = true)
 |-- amount: long (nullable = true)
 |-- date: string (nullable = true)
 |-- shop: string (nullable = true)
 |-- transaction_code: string (nullable = true)
应用regexp\u replace后的架构

root
 |-- date_type: date (nullable = true)
 |-- c_phone_number: string (nullable = true)
 |-- c_account_id: string (nullable = true)
 |-- c_credit_card_limit: string (nullable = true)
 |-- c_credit_card_number: string (nullable = true)
 |-- c_amount: string (nullable = true)
 |-- c_full_name: string (nullable = true)
 |-- c_transaction_code: string (nullable = true)
 |-- c_shop: string (nullable = true)
我使用的代码是:

df=df.withColumn('c_phone_number',regexp_replace("phone_number","[^0-9]","")).drop('phone_number')
df=df.withColumn('c_account_id',regexp_replace("account_id","[^0-9]","")).drop('account_id')
df=df.withColumn('c_credit_card_limit',regexp_replace("credit_card_limit","[^0-9]","")).drop('credit_card_limit')
df=df.withColumn('c_credit_card_number',regexp_replace("credit_card_number","[^0-9]","")).drop('credit_card_number')
df=df.withColumn('c_amount',regexp_replace("amount","[^0-9 ]","")).drop('amount')
df=df.withColumn('c_full_name',regexp_replace("full_name","[^a-zA-Z ]","")).drop('full_name')
df=df.withColumn('c_transaction_code',regexp_replace("transaction_code","[^a-zA-Z]","")).drop('transaction_code')
df=df.withColumn('c_shop',regexp_replace("shop","[^a-zA-Z ]","")).drop('shop')

为什么会这样?有没有办法将其转换为原始数据类型,或者我是否应该再次使用cast?

您可能想查看spark git中用于
regexp\u replace
-

覆盖def nullSafeEval(s:Any、p:Any、r:Any):Any={
如果(!p.equals(lastRegex)){
//正则表达式值已更改
lastRegex=p.asInstanceOf[UTF8String].clone()
pattern=pattern.compile(lastRegex.toString)
}
如果(!r.equals(lastReplacementInUTF8)){
//替换字符串已更改
lastplacementinutf8=r.asInstanceOf[UTF8String].clone()
lastReplacement=lastReplacementInUTF8.toString
}
val m=pattern.matcher(s.toString())
result.delete(0,result.length())
while(m.find){
m、 附件更换(结果,上次更换)
}
m、 附录尾(结果)
UTF8String.fromString(result.toString)
}
  • 上面的代码接受表达式为
    Any
    ,然后对其调用
    toString()
  • 最后,它在
    toString
  • UTF8String.fromString(result.toString)
    
    参考-