Apache spark sql 使用Spark SQL中的单个空格删除多个空格_Apache Spark Sql

Apache spark sql 使用Spark SQL中的单个空格删除多个空格

Apache spark sql 使用Spark SQL中的单个空格删除多个空格,apache-spark-sql,Apache Spark Sql,我使用HiveContext创建了DataFrame，其中一列包含以下记录： text1 text2 我们希望将两个文本之间的中间空格替换为单个文本，并获得最终输出： text1 text2 我们如何在Spark SQL中实现这一点？注意，我们正在使用配置单元上下文，注册临时表并在其上编写SQL查询 import org.apache.spark.sql.functions._ val myUDf = udf((s:String) => Array(s.trim.rep

我使用HiveContext创建了DataFrame，其中一列包含以下记录：

text1        text2

我们希望将两个文本之间的中间空格替换为单个文本，并获得最终输出：

text1 text2

我们如何在Spark SQL中实现这一点？注意，我们正在使用配置单元上下文，注册临时表并在其上编写SQL查询

import org.apache.spark.sql.functions._

val myUDf = udf((s:String) => Array(s.trim.replaceAll(" +", " ")))
//error: object java.lang.String is not a value --> use Array

val data = List("i  like    cheese", "  the dog runs   ", "text111111   text2222222")
val df = data.toDF("val")
df.show()

val new_df = df
  .withColumn("udfResult",myUDf(col("val")))
  .withColumn("new_val", col("udfResult")(0))
  .drop("udfResult")
new_df.show

数据块上的输出

+--------------------+
|                 val|
+--------------------+
|   i  like    cheese|
|     the dog runs   |
|text111111   text...|
+--------------------+

+--------------------+--------------------+
|                 val|             new_val|
+--------------------+--------------------+
|   i  like    cheese|       i like cheese|
|     the dog runs   |        the dog runs|
|text111111   text...|text111111 text22...|
+--------------------+--------------------+

数据块上的输出

+--------------------+
|                 val|
+--------------------+
|   i  like    cheese|
|     the dog runs   |
|text111111   text...|
+--------------------+

+--------------------+--------------------+
|                 val|             new_val|
+--------------------+--------------------+
|   i  like    cheese|       i like cheese|
|     the dog runs   |        the dog runs|
|text111111   text...|text111111 text22...|
+--------------------+--------------------+

更妙的是，我现在得到了一位真正的专家的启发。事实上更简单：

   import org.apache.spark.sql.functions._
// val myUDf = udf((s:String) => Array(s.trim.replaceAll(" +", " ")))
   val myUDf = udf((s:String) => s.trim.replaceAll("\\s+", " ")) // <-- no Array(...)
// Then there is no need to play with columns excessively:

   val data = List("i  like    cheese", "  the dog runs   ", "text111111   text2222222")
   val df = data.toDF("val")
   df.show()

   val new_df = df.withColumn("new_val", myUDf(col("val")))
   new_df.show

import org.apache.spark.sql.functions_
//val myUDf=udf（（s:String）=>Array（s.trim.replaceAll（“+”，”））
val myUDf=udf（（s:String）=>s.trim.replaceAll（“\\s+”，“”））/比现在一位真正的专家给我的启发更好。事实上更简单：
   import org.apache.spark.sql.functions._
// val myUDf = udf((s:String) => Array(s.trim.replaceAll(" +", " ")))
   val myUDf = udf((s:String) => s.trim.replaceAll("\\s+", " ")) // <-- no Array(...)
// Then there is no need to play with columns excessively:

   val data = List("i  like    cheese", "  the dog runs   ", "text111111   text2222222")
   val df = data.toDF("val")
   df.show()

   val new_df = df.withColumn("new_val", myUDf(col("val")))
   new_df.show

import org.apache.spark.sql.functions_
//val myUDf=udf（（s:String）=>Array（s.trim.replaceAll（“+”，”））
val myUDf=udf（（s:String）=>s.trim.replaceAll（“\\s+”，“”））/只需在spark.sql中执行即可
regexp_replace（列“+”，“”）

检查它：
spark.sql("""
    select regexp_replace(col1, ' +', ' ') as col2
    from (
        select 'text1        text2     text3' as col1
    )
""").show(20,False)

输出
+-----------------+
|col2             |
+-----------------+
|text1 text2 text3|
+-----------------+

只需在spark.sql中执行即可
regexp_replace（列“+”，“”）

检查它：
spark.sql("""
    select regexp_replace(col1, ' +', ' ') as col2
    from (
        select 'text1        text2     text3' as col1
    )
""").show(20,False)

输出
+-----------------+
|col2             |
+-----------------+
|text1 text2 text3|
+-----------------+

我想你的意思是通过HiveContextColumn保存记录？先使用trim，然后使用Concatatite bot在下面的BetweenBeter方法中添加空间我想你的意思是通过HiveContextColumn保存记录？先使用trim，然后使用Concatatite bot在下面的BetweenBeter方法中添加空间