“如何修复异常”；类型为'的文本；E'；“当前不支持”；在Scala中的数据帧上应用regex_replace时？_Regex_Postgresql_Scala_Apache Spark

“如何修复异常”；类型为'的文本；E'；“当前不支持”；在Scala中的数据帧上应用regex_replace时？

regex postgresql scala apache-spark

“如何修复异常”；类型为'的文本；E'；“当前不支持”；在Scala中的数据帧上应用regex_replace时？,regex,postgresql,scala,apache-spark,Regex,Postgresql,Scala,Apache Spark,我通过读取RDBMS表创建了一个数据框，如下所示： val dataDF = spark.read.format("jdbc").option("url", connectionUrl) .option("dbtable", s"(${query}) as year2017") .option("user"

我通过读取RDBMS表创建了一个数据框，如下所示：

val dataDF = spark.read.format("jdbc").option("url", connectionUrl)
                                                .option("dbtable", s"(${query}) as year2017")
                                                .option("user", devUserName)
                                                .option("password", devPassword)
                                                .option("numPartitions",15)
                                                .load()

在将数据摄取到HDFS上的配置单元表之前，我们被要求对数据帧的字符串数据类型列应用regex_替换模式。我就是这样应用它的：

val regExpr = dataDF.schema.fields.map { x =>
  if (x.dataType == StringType)
  "regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(%s, E'[\\\\n]+', ' ', 'g' ), E'[\\\\r]+', ' ', 'g' ), E'[\\\\t]+', ' ', 'g' ), E'[\\\\cA]+', ' ', 'g' ), E'[\\\\ca]+', ' ', 'g' ) as %s".format(x.name, x.name)
  else
    x.name
}
dataDF.selectExpr(regExpr:_*)

但当我执行代码时，它以以下异常结束：

Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException:
Literals of type 'E' are currently not supported.(line 1, pos 88)

== SQL ==
regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(period_name, E'[\\n]+', ' ', 'g' ), E'[\\r]+', ' ', 'g' ), E'[\\t]+', ' ', 'g' ), E'[\\cA]+', ' ', 'g' ), E'[\\ca]+', ' ', 'g' ) as period_name
----------------------------------------------------------------------------------------^^^

我使用：println（dataDF.schema）打印模式。代码正确标识了字符串列，您可以在其中看到列名：

period\u name

Schema: StructType(StructField(forecast_id,LongType,true), StructField(period_num,DecimalType(15,0),true), StructField(period_name,StringType,true), StructField(drm_org,StringType,true), StructField(ledger_id,LongType,true), StructField(currency_code,StringType,true))

要求删除多种格式的空格。字符串列中的数据可以具有多种格式的带空格的值

1,             b,c,   d,

e,Ωåf

有多个空格、制表符空格、新行后出现的值、删除特殊字符（如果有）等。，上述行应转换为：1、b、c、d、e、f

postgres数据库中存在表读取。我试图理解为什么是E导致异常，但我无法得到线索。有人能告诉我如何修复此异常吗？

出于测试目的，我使用提供的字符串创建了一个dataframe，其中col3列中包含特殊字符，如下所示：

+----+----+--------------------------------------------------------------------+
|col1|col2|col3                                                                |
+----+----+--------------------------------------------------------------------+
|a   |1   |1,          -   b,c,   d,
                 |
                 |e,Ωåf|
+----+----+--------------------------------------------------------------------+

然后，按照注释中的建议，使用stringtype选择列，使用foldleft和withColumn，并使用

regexp\u replace

inside withColumn，您可以执行以下操作

//getting column names which are string type
val stringColumns = df.schema.fields.filter(_.dataType == StringType).map(_.name)
//applying regex to replace all characters except digits, characters (a to z and A to Z) and commas
import org.apache.spark.sql.functions._
val finaldf = stringColumns.foldLeft(df){(tempdf, colName) => tempdf.withColumn(colName, regexp_replace(col(colName), "[ ](?=[ ])|[^,A-Za-z0-9]+", ""))}

因此，

finaldf

如下

+----+----+-----------+
|col1|col2|col3       |
+----+----+-----------+
|a   |1   |1,b,c,d,e,f|
+----+----+-----------+

您可以根据需要更改正则表达式模式

[]（？=[]）|[^，A-Za-z0-9]+

。现在

，A-Za-z0-9

字符仅未删除

我希望答案是有帮助的

使用这么多嵌套的regexp\u replace，您想做什么？你能用一个例子解释一下吗？实际上我对正则表达式很陌生。这是在一个已经在工作的代码中使用的。我被告知在将数据接收到Hive之前应用它。我只是用同样的方法。如果您想查看现有/旧代码，我也可以添加这一部分。你能举例说明你想用什么替换什么吗？regexp_replace是一个内置函数，应用于spark dataframe列，你正在应用于你的模式。我现在建议你选择stringtype列，使用foldleft和withColumn，并在内部使用regexp_replace使用ColumnYes并让我知道是否有任何问题，如果它有效，请不要忘记接受和upvoteI尝试如下：val stringColumns=dataDF.schema.fields.filter（.dataType==StringType.map（.name）val finaldf=stringColumns.foldLeft（dataDF）{（tempdf，colName）=>tempdf.withColumn（colName，regexp\u replace（col（col（colName），“\\\n+”，“”）}在以下行中获取编译时错误：regexp_replace（col（colName），“\\s+”，“，”g'），并显示消息：“无法解析符号col”。是否有任何需要更改的内容？我已编辑了答案。您所需要的只是导入org.apache.spark.sql.functions.的功能。