Apache spark Pyspark数据帧中的regexp_替换_Apache Spark_Hadoop_Pyspark_Apache Spark Sql_Pyspark Dataframes

Apache spark Pyspark数据帧中的regexp_替换

apache-spark hadoop pyspark

Apache spark Pyspark数据帧中的regexp_替换,apache-spark,hadoop,pyspark,apache-spark-sql,pyspark-dataframes,Apache Spark,Hadoop,Pyspark,Apache Spark Sql,Pyspark Dataframes,我在Pyspark数据帧上运行了regexp\u replace命令，之后所有数据的数据类型都变为字符串。为什么会这样下面是我在使用regex_replace之前的表格 root |-- account_id: long (nullable = true) |-- credit_card_limit: long (nullable = true) |-- credit_card_number: long (nullable = true) |-- first_name: string

我在Pyspark数据帧上运行了

regexp\u replace

命令，之后所有数据的数据类型都变为字符串。为什么会这样

下面是我在使用regex_replace之前的表格

root
 |-- account_id: long (nullable = true)
 |-- credit_card_limit: long (nullable = true)
 |-- credit_card_number: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_number: long (nullable = true)
 |-- amount: long (nullable = true)
 |-- date: string (nullable = true)
 |-- shop: string (nullable = true)
 |-- transaction_code: string (nullable = true)

应用regexp\u replace后的架构

root
 |-- date_type: date (nullable = true)
 |-- c_phone_number: string (nullable = true)
 |-- c_account_id: string (nullable = true)
 |-- c_credit_card_limit: string (nullable = true)
 |-- c_credit_card_number: string (nullable = true)
 |-- c_amount: string (nullable = true)
 |-- c_full_name: string (nullable = true)
 |-- c_transaction_code: string (nullable = true)
 |-- c_shop: string (nullable = true)

我使用的代码是：

df=df.withColumn('c_phone_number',regexp_replace("phone_number","[^0-9]","")).drop('phone_number')
df=df.withColumn('c_account_id',regexp_replace("account_id","[^0-9]","")).drop('account_id')
df=df.withColumn('c_credit_card_limit',regexp_replace("credit_card_limit","[^0-9]","")).drop('credit_card_limit')
df=df.withColumn('c_credit_card_number',regexp_replace("credit_card_number","[^0-9]","")).drop('credit_card_number')
df=df.withColumn('c_amount',regexp_replace("amount","[^0-9 ]","")).drop('amount')
df=df.withColumn('c_full_name',regexp_replace("full_name","[^a-zA-Z ]","")).drop('full_name')
df=df.withColumn('c_transaction_code',regexp_replace("transaction_code","[^a-zA-Z]","")).drop('transaction_code')
df=df.withColumn('c_shop',regexp_replace("shop","[^a-zA-Z ]","")).drop('shop')

为什么会这样？有没有办法将其转换为原始数据类型，或者我是否应该再次使用cast？

您可能想查看spark git中用于

regexp\u replace

覆盖def nullSafeEval（s:Any、p:Any、r:Any）：Any={
如果（！p.equals（lastRegex））{
//正则表达式值已更改
lastRegex=p.asInstanceOf[UTF8String].clone（）
pattern=pattern.compile（lastRegex.toString）
}
如果（！r.equals（lastReplacementInUTF8））{
//替换字符串已更改
lastplacementinutf8=r.asInstanceOf[UTF8String].clone（）
lastReplacement=lastReplacementInUTF8.toString
}
val m=pattern.matcher（s.toString（））
result.delete（0，result.length（））
while（m.find）{
m、 附件更换（结果，上次更换）
}
m、 附录尾（结果）
UTF8String.fromString（result.toString）
}

上面的代码接受表达式为

Any

，然后对其调用

toString（）

最后，它在

toString

UTF8String.fromString（result.toString）

参考-