Regex 如何在scala(spark)中提取特定字符串后的值?
我有一个带有列的数据帧: df= 我需要得到:Regex 如何在scala(spark)中提取特定字符串后的值?,regex,scala,apache-spark,dataframe,Regex,Scala,Apache Spark,Dataframe,我有一个带有列的数据帧: df= 我需要得到: itemType count it_shampoo 5 it_books 5 it_mm 5 it_mm 5 it_books 5 it_books 5 如何提取将it\u bo
itemType count
it_shampoo 5
it_books 5
it_mm 5
it_mm 5
it_books 5
it_books 5
如何提取将it\u books it\u books
,{=it\u books}it\u books
替换为it\u books
。项目类型将始终跟随it
尝试regex,^.*(it[\w]+).$
到项目类型,并替换为第一个捕获的组$1
下面的正则表达式也适用
scala> val df = Seq(("it_shampoo",5),
| ("it_books",5),
| ("it_mm",5),
| ("{it_mm}",5),
| ("it_books it_books",5),
| ("{=it_books} it_books",5)).toDF("itemType","count")
df: org.apache.spark.sql.DataFrame = [itemType: string, count: int]
scala> df.select( regexp_replace('itemtype,""".*\b(\S+)\b(.*)$""", "$1").as("replaced"),'count).show
+----------+-----+
| replaced|count|
+----------+-----+
|it_shampoo| 5|
| it_books| 5|
| it_mm| 5|
| it_mm| 5|
| it_books| 5|
| it_books| 5|
+----------+-----+
scala>
你的正则表达式为我工作。我使用的代码是
val df2=df.withColumn(“itemType”,regexp\u extract($“itemType”,即“^.*”)。(it.[\w]+).$”,1))
scala> val df = Seq(("it_shampoo",5),
| ("it_books",5),
| ("it_mm",5),
| ("{it_mm}",5),
| ("it_books it_books",5),
| ("{=it_books} it_books",5)).toDF("itemType","count")
df: org.apache.spark.sql.DataFrame = [itemType: string, count: int]
scala> df.select( regexp_replace('itemtype,""".*\b(\S+)\b(.*)$""", "$1").as("replaced"),'count).show
+----------+-----+
| replaced|count|
+----------+-----+
|it_shampoo| 5|
| it_books| 5|
| it_mm| 5|
| it_mm| 5|
| it_books| 5|
| it_books| 5|
+----------+-----+
scala>