Python 3.x 我想根据PySpark中的模式匹配提取所有条目作为列表
我有一个名为tags的字段。它包含一个或多个以大小开头的值 图案为大小 例如:Python 3.x 我想根据PySpark中的模式匹配提取所有条目作为列表,python-3.x,dataframe,apache-spark,pyspark,Python 3.x,Dataframe,Apache Spark,Pyspark,我有一个名为tags的字段。它包含一个或多个以大小开头的值 图案为大小 例如: +---------------------------------------------+ | tags | +---------------------------------------------+ |The size available are size_10 and size_100. | |
+---------------------------------------------+
| tags |
+---------------------------------------------+
|The size available are size_10 and size_100. |
| |
|The size available are size_10 |
|The size available are size_20 |
我想单独提取作为数组的值(即)
您能帮我解决…在scala中,python版本几乎相同:
val df = Seq("The size available are size_10 and size_100."," ","The size available are size_10","The size available are size_20").toDF()
df.show(false)
+--------------------------------------------+
|value |
+--------------------------------------------+
|The size available are size_10 and size_100.|
| |
|The size available are size_10 |
|The size available are size_20 |
+--------------------------------------------+
df.select('value,split(regexp_replace('value, "(?:size_?)[^\\s]+","")," ").as("a"),split('value," ").as("b"))
.select('value,split(regexp_replace(concat_ws(",",array_except('b,'a)),"[^0-9$,]",""),",").as("size"))
.show(false)
+--------------------------------------------+---------+
|value |size |
+--------------------------------------------+---------+
|The size available are size_10 and size_100.|[10, 100]|
| |[] |
|The size available are size_10 |[10] |
|The size available are size_20 |[20] |
+--------------------------------------------+---------+
上述代码的Python等价物是:
df.withColumn('d',f.split(f.regexp_replace(f.concat_ws(',',f.array_except(f.split('data',' '),f.split(f.regexp_replace('data','(size_\d+)',''),' ')))
,"[^0-9$,]",""),',')).show(20,False)
如果您的数据集不太大,您也可以使用udf来完成
import re
from pyspark.sql.functions import udf
extract = udf(lambda s: list(map(lambda x: x.split('_')[1] if len(x)>0 else x,re.findall(r'(size_\d+)', s))), ArrayType(StringType()))
df.withColumn('values', extract('data')).show()
两种情况下的输出
+--------------------+---------+
| data| values|
+--------------------+---------+
|The size availabl...|[10, 100]|
|The size availabl...| [10]|
| | []|
|The size availabl...| [20]|
| size_10| [10]|
+--------------------+---------+
所有函数都来自spark.sql.functions,它们在scala和python中是相同的,当然,您应该调整pythonWorks的语法,比如charm
+--------------------+---------+
| data| values|
+--------------------+---------+
|The size availabl...|[10, 100]|
|The size availabl...| [10]|
| | []|
|The size availabl...| [20]|
| size_10| [10]|
+--------------------+---------+