Spark SQL区分大小写的筛选器打开列的模式
如何使用spark sql筛选器作为模式的列基础上的区分大小写的筛选器 例如,我有一个模式: “Aaaa” 我的专栏有如下数据:Spark SQL区分大小写的筛选器打开列的模式,sql,regex,scala,apache-spark,apache-spark-sql,Sql,Regex,Scala,Apache Spark,Apache Spark Sql,如何使用spark sql筛选器作为模式的列基础上的区分大小写的筛选器 例如,我有一个模式: “Aaaa” 我的专栏有如下数据: adaz LssA ss Leds ST Pear QA Lear QA 我想检索具有“Aaaa”模式的行,关于字母大小写。这意味着所需的行将是“LED ST”、“Pear QA”、“Lear QA” "Aaaa AA" => 'Leds ST' , 'Pear QA', 'Lear QA' "AaaA aa" => 'LssA ss' "aaaa
adaz
LssA ss
Leds ST
Pear QA
Lear QA
我想检索具有“Aaaa”模式的行,关于字母大小写。这意味着所需的行将是“LED ST”、“Pear QA”、“Lear QA”
"Aaaa AA" => 'Leds ST' , 'Pear QA', 'Lear QA'
"AaaA aa" => 'LssA ss'
"aaaa" => 'adaz'
如何使用spark sql获得此结果?
或者我们可以为这个结果编写任何正则表达式sql查询吗
我们可以使用Spark SQL函数translate()
为字符串创建分组列
使用Pypark:
用于测试的示例数据帧
from pyspark.sql.types import StringType
df = spark.createDataFrame(["adaz", "LssA ss", "Leds ST", "Pear QA","Lear QA"], StringType())
实际变换
from pyspark.sql.functions import translate, collect_list, col
import string
lowercases = string.ascii_lowercase
uppercases = string.ascii_uppercase
length_alphabet = len(uppercases)
ones = "1" * length_alphabet
zeroes = "0" * length_alphabet
old = uppercases + lowercases
new = ones + zeroes
df.withColumn("group", translate(df.value, old, new)) \
.groupBy(col("group")).agg(collect_list(df.value).alias("strings")) \
.show(truncate = False)
结果:
使用Scala Spark:
使用“regexp_extract”:
输出:
+-------+
|value |
+-------+
|Leds ST|
|Pear QA|
|Lear QA|
+-------+
我在@pasha701上扩展
scala> val df=List(
| "adaz",
| "LssA ss",
| "Leds ST",
| "Pear QA",
| "Lear QA"
| ).toDF("value")
df: org.apache.spark.sql.DataFrame = [value: string]
scala> val df2= df.withColumn("reg1", regexp_extract($"value","^[A-Z][a-z]{3} [A-Z]{2}$",0)=!=lit("")).withColumn("reg2",regexp_extract($"value","^[a-z]{4}$",0)=!=lit("")).withColumn("reg3", regexp_extract($"value","^[A-Z][a-z]{2}[A-Z] [a-z]{2}$",0)=!=lit(""))
df2: org.apache.spark.sql.DataFrame = [value: string, reg1: boolean ... 2 more fields]
scala> val df3=df2.withColumn("reg_patt", when('reg1,"1000 11").when('reg2,"0000").when('reg3,"1001 00").otherwise("9"))
df3: org.apache.spark.sql.DataFrame = [value: string, reg1: boolean ... 3 more fields]
scala> df3.groupBy("reg_patt").agg(collect_list('value) as "newval").show(false)
+--------+---------------------------+
|reg_patt|newval |
+--------+---------------------------+
|1000 11 |[Leds ST, Pear QA, Lear QA]|
|0000 |[adaz] |
|1001 00 |[LssA ss] |
+--------+---------------------------+
scala>
动态模式匹配,有趣。长度也由模式暗示?此解决方案可能给出问题的部分解决方案。
val df=List(
"adaz",
"LssA ss",
"Leds ST",
"Pear QA",
"Lear QA"
).toDF("value")
df.filter(regexp_extract($"value","^[A-Z][a-z]{3} [A-Z]{2}$",0)=!=lit("")).show(false)
+-------+
|value |
+-------+
|Leds ST|
|Pear QA|
|Lear QA|
+-------+
scala> val df=List(
| "adaz",
| "LssA ss",
| "Leds ST",
| "Pear QA",
| "Lear QA"
| ).toDF("value")
df: org.apache.spark.sql.DataFrame = [value: string]
scala> val df2= df.withColumn("reg1", regexp_extract($"value","^[A-Z][a-z]{3} [A-Z]{2}$",0)=!=lit("")).withColumn("reg2",regexp_extract($"value","^[a-z]{4}$",0)=!=lit("")).withColumn("reg3", regexp_extract($"value","^[A-Z][a-z]{2}[A-Z] [a-z]{2}$",0)=!=lit(""))
df2: org.apache.spark.sql.DataFrame = [value: string, reg1: boolean ... 2 more fields]
scala> val df3=df2.withColumn("reg_patt", when('reg1,"1000 11").when('reg2,"0000").when('reg3,"1001 00").otherwise("9"))
df3: org.apache.spark.sql.DataFrame = [value: string, reg1: boolean ... 3 more fields]
scala> df3.groupBy("reg_patt").agg(collect_list('value) as "newval").show(false)
+--------+---------------------------+
|reg_patt|newval |
+--------+---------------------------+
|1000 11 |[Leds ST, Pear QA, Lear QA]|
|0000 |[adaz] |
|1001 00 |[LssA ss] |
+--------+---------------------------+
scala>