Regex 如何在scala中使用正则表达式模式匹配替换部分字符串？_Regex_Scala

Regex 如何在scala中使用正则表达式模式匹配替换部分字符串？

regex scala

Regex 如何在scala中使用正则表达式模式匹配替换部分字符串？,regex,scala,Regex,Scala,我有一个包含列名和数据类型的字符串，如下所示： val cdt = "header:integer|releaseNumber:numeric|amountCredit:numeric|lastUpdatedBy:numeric(15,10)|orderNumber:numeric(20,0)" header:integer|releaseNumber:decimal(38,30)|amountCredit:decimal(38,30)|lastUpdatedBy:decimal(15,10)

我有一个包含列名和数据类型的字符串，如下所示：

val cdt = "header:integer|releaseNumber:numeric|amountCredit:numeric|lastUpdatedBy:numeric(15,10)|orderNumber:numeric(20,0)"

header:integer|releaseNumber:decimal(38,30)|amountCredit:decimal(38,30)|lastUpdatedBy:decimal(15,10)|orderNumber:bigint

我的要求是将postgres数据类型转换为spark sql兼容的数据类型，这些数据类型以

数字、数字（15,10）

的形式出现。在这种情况下,

numeric         -> decimal(38,30)
numeric(15,10)  -> decimal(15,10)
numeric(20,0)   -> bigint   (This is an integeral datatype as there its precision is zero.)

为了访问字符串cdt中的数据类型，我将其拆分并从中创建了一个Seq

val dt = cdt.split("\\|").toSeq

现在我有一系列元素，其中每个元素都是以下格式的字符串：

Seq("header:integer", "releaseNumber:numeric","amountCredit:numeric","lastUpdatedBy:numeric(15,10)","orderNumber:numeric(20,0)")

我有模式匹配正则表达式：

“numeric\（\d+，（\d+））”.r

，用于数值（精度、刻度），仅当存在两位数的刻度，例如：数字（20,23）。我对REGEX和Scala非常陌生&我不知道如何为其余两种情况创建REGEX模式&将其应用于字符串以匹配条件。我用下面的方法尝试了它，但它给了我一个编译错误：“无法解析符号FindFirstMachin”

我正在尝试将最终输出转换为字符串，如下所示：

val cdt = "header:integer|releaseNumber:numeric|amountCredit:numeric|lastUpdatedBy:numeric(15,10)|orderNumber:numeric(20,0)"

header:integer|releaseNumber:decimal(38,30)|amountCredit:decimal(38,30)|lastUpdatedBy:decimal(15,10)|orderNumber:bigint

如何为不同的情况创建多个正则表达式模式，以检查/应用seq中每个值的数据类型上的模式匹配，并将其更改为上面提到的适合我的数据类型

有人能告诉我如何实现它吗？

可以使用单个正则表达式模式完成，但需要对匹配结果进行一些测试

val numericRE = raw"([^:]+):numeric(?:\((\d+),(\d+)\))?".r

cdt.split("\\|")
   .map{
     case numericRE(col,a,b) =>
       if (Option(b).isEmpty) s"$col:decimal(38,30)"
       else if (b == "0")     s"$col:bigint"
       else                   s"$col:decimal($a,$b)"
     case x => x  //pass-through
  }.mkString("|")

//res0: String = header:integer|releaseNumber:decimal(38,30)|amountCredit:decimal(38,30)|lastUpdatedBy:decimal(15,10)|orderNumber:bigint

当然，它可以用三种不同的正则表达式模式来实现，但我认为这是非常清楚的

解释

```
raw
```
-不需要那么多转义字符-
```
\
```
```
（[^::]+）
```
-捕获第一个冒号之前的所有内容
```
：numeric
```
-后跟字符串“：numeric”
```
（？：
```
-启动非捕获组
```
\（\d+，（\d+）
```
-捕获括号内用逗号分隔的两位字符串
```
）？
```
-非捕获组是可选的
```
numericRE（col，a，b）
```
-
```
col
```
是第一个捕获组，
```
a
```
和
```
b
```
是数字捕获，但它们位于可选的非捕获组内，因此它们可能是
```
null
```

在您的示例中，

（38,30）

来自哪里？

changeDataType（）

返回的格式是什么？@jwvh，它是（38,30）来自spark数据帧。当我在postgres上阅读该表时，spark正在根据其兼容的数据类型推断模式。但是postgres上相应的数据类型只是数字。如果我试图将数据帧保存到配置单元表中，它会给我一个例外“精度39超过最大限制38”。但是如果您直接读取（38,30）中的值，它会正确地传递内容。关于changeDataType（），我已经更新了问题。现在请看一看。太棒了，它成功了。请您解释一下表达式-->raw“（[^::+）：numeric（？：（（\d+），（\d+））是如何表达的好吗