Apache spark sql 使用Spark SQL中的单个空格删除多个空格
我使用HiveContext创建了DataFrame,其中一列包含以下记录:Apache spark sql 使用Spark SQL中的单个空格删除多个空格,apache-spark-sql,Apache Spark Sql,我使用HiveContext创建了DataFrame,其中一列包含以下记录: text1 text2 我们希望将两个文本之间的中间空格替换为单个文本,并获得最终输出: text1 text2 我们如何在Spark SQL中实现这一点?注意,我们正在使用配置单元上下文,注册临时表并在其上编写SQL查询 import org.apache.spark.sql.functions._ val myUDf = udf((s:String) => Array(s.trim.rep
text1 text2
我们希望将两个文本之间的中间空格替换为单个文本,并获得最终输出:
text1 text2
我们如何在Spark SQL中实现这一点?注意,我们正在使用配置单元上下文,注册临时表并在其上编写SQL查询
import org.apache.spark.sql.functions._
val myUDf = udf((s:String) => Array(s.trim.replaceAll(" +", " ")))
//error: object java.lang.String is not a value --> use Array
val data = List("i like cheese", " the dog runs ", "text111111 text2222222")
val df = data.toDF("val")
df.show()
val new_df = df
.withColumn("udfResult",myUDf(col("val")))
.withColumn("new_val", col("udfResult")(0))
.drop("udfResult")
new_df.show
数据块上的输出
+--------------------+
| val|
+--------------------+
| i like cheese|
| the dog runs |
|text111111 text...|
+--------------------+
+--------------------+--------------------+
| val| new_val|
+--------------------+--------------------+
| i like cheese| i like cheese|
| the dog runs | the dog runs|
|text111111 text...|text111111 text22...|
+--------------------+--------------------+
数据块上的输出
+--------------------+
| val|
+--------------------+
| i like cheese|
| the dog runs |
|text111111 text...|
+--------------------+
+--------------------+--------------------+
| val| new_val|
+--------------------+--------------------+
| i like cheese| i like cheese|
| the dog runs | the dog runs|
|text111111 text...|text111111 text22...|
+--------------------+--------------------+
更妙的是,我现在得到了一位真正的专家的启发。事实上更简单:
import org.apache.spark.sql.functions._
// val myUDf = udf((s:String) => Array(s.trim.replaceAll(" +", " ")))
val myUDf = udf((s:String) => s.trim.replaceAll("\\s+", " ")) // <-- no Array(...)
// Then there is no need to play with columns excessively:
val data = List("i like cheese", " the dog runs ", "text111111 text2222222")
val df = data.toDF("val")
df.show()
val new_df = df.withColumn("new_val", myUDf(col("val")))
new_df.show
import org.apache.spark.sql.functions_
//val myUDf=udf((s:String)=>Array(s.trim.replaceAll(“+”,”))
val myUDf=udf((s:String)=>s.trim.replaceAll(“\\s+”,“”))/比现在一位真正的专家给我的启发更好。事实上更简单:
import org.apache.spark.sql.functions._
// val myUDf = udf((s:String) => Array(s.trim.replaceAll(" +", " ")))
val myUDf = udf((s:String) => s.trim.replaceAll("\\s+", " ")) // <-- no Array(...)
// Then there is no need to play with columns excessively:
val data = List("i like cheese", " the dog runs ", "text111111 text2222222")
val df = data.toDF("val")
df.show()
val new_df = df.withColumn("new_val", myUDf(col("val")))
new_df.show
import org.apache.spark.sql.functions_
//val myUDf=udf((s:String)=>Array(s.trim.replaceAll(“+”,”))
val myUDf=udf((s:String)=>s.trim.replaceAll(“\\s+”,“”))/只需在spark.sql中执行即可
regexp_replace(列“+”,“”)
检查它:
spark.sql("""
select regexp_replace(col1, ' +', ' ') as col2
from (
select 'text1 text2 text3' as col1
)
""").show(20,False)
输出
+-----------------+
|col2 |
+-----------------+
|text1 text2 text3|
+-----------------+
只需在spark.sql中执行即可
regexp_replace(列“+”,“”)
检查它:
spark.sql("""
select regexp_replace(col1, ' +', ' ') as col2
from (
select 'text1 text2 text3' as col1
)
""").show(20,False)
输出
+-----------------+
|col2 |
+-----------------+
|text1 text2 text3|
+-----------------+
我想你的意思是通过HiveContextColumn保存记录?先使用trim,然后使用Concatatite bot在下面的BetweenBeter方法中添加空间我想你的意思是通过HiveContextColumn保存记录?先使用trim,然后使用Concatatite bot在下面的BetweenBeter方法中添加空间