Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex 如何在scala(spark)中提取特定字符串后的值?_Regex_Scala_Apache Spark_Dataframe - Fatal编程技术网

Regex 如何在scala(spark)中提取特定字符串后的值?

Regex 如何在scala(spark)中提取特定字符串后的值?,regex,scala,apache-spark,dataframe,Regex,Scala,Apache Spark,Dataframe,我有一个带有列的数据帧: df= 我需要得到: itemType count it_shampoo 5 it_books 5 it_mm 5 it_mm 5 it_books 5 it_books 5 如何提取将it\u bo

我有一个带有列的数据帧:

df=

我需要得到:

itemType                   count
it_shampoo                  5
it_books                    5
it_mm                       5
it_mm                       5
it_books                    5
it_books                    5
如何提取将
it\u books it\u books
{=it\u books}it\u books
替换为
it\u books
。项目类型将始终跟随
it

尝试regex,
^.*(it[\w]+).$
到项目类型,并替换为第一个捕获的组
$1


下面的正则表达式也适用

scala> val df = Seq(("it_shampoo",5),
     | ("it_books",5),
     | ("it_mm",5),
     | ("{it_mm}",5),
     | ("it_books it_books",5),
     | ("{=it_books} it_books",5)).toDF("itemType","count")
df: org.apache.spark.sql.DataFrame = [itemType: string, count: int]

scala> df.select( regexp_replace('itemtype,""".*\b(\S+)\b(.*)$""", "$1").as("replaced"),'count).show
+----------+-----+
|  replaced|count|
+----------+-----+
|it_shampoo|    5|
|  it_books|    5|
|     it_mm|    5|
|     it_mm|    5|
|  it_books|    5|
|  it_books|    5|
+----------+-----+


scala>

你的正则表达式为我工作。我使用的代码是
val df2=df.withColumn(“itemType”,regexp\u extract($“itemType”,即“^.*”)。(it.[\w]+).$”,1))
scala> val df = Seq(("it_shampoo",5),
     | ("it_books",5),
     | ("it_mm",5),
     | ("{it_mm}",5),
     | ("it_books it_books",5),
     | ("{=it_books} it_books",5)).toDF("itemType","count")
df: org.apache.spark.sql.DataFrame = [itemType: string, count: int]

scala> df.select( regexp_replace('itemtype,""".*\b(\S+)\b(.*)$""", "$1").as("replaced"),'count).show
+----------+-----+
|  replaced|count|
+----------+-----+
|it_shampoo|    5|
|  it_books|    5|
|     it_mm|    5|
|     it_mm|    5|
|  it_books|    5|
|  it_books|    5|
+----------+-----+


scala>