Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 我想根据PySpark中的模式匹配提取所有条目作为列表_Python 3.x_Dataframe_Apache Spark_Pyspark - Fatal编程技术网

Python 3.x 我想根据PySpark中的模式匹配提取所有条目作为列表

Python 3.x 我想根据PySpark中的模式匹配提取所有条目作为列表,python-3.x,dataframe,apache-spark,pyspark,Python 3.x,Dataframe,Apache Spark,Pyspark,我有一个名为tags的字段。它包含一个或多个以大小开头的值 图案为大小 例如: +---------------------------------------------+ | tags | +---------------------------------------------+ |The size available are size_10 and size_100. | |

我有一个名为tags的字段。它包含一个或多个以大小开头的值

图案为大小

例如:

+---------------------------------------------+
|                tags                         |
+---------------------------------------------+
|The size available are size_10 and size_100. |
|                                             |
|The size available are size_10               |
|The size available are size_20               |
我想单独提取作为数组的值(即)


您能帮我解决…

在scala中,python版本几乎相同:

val df = Seq("The size available are size_10 and size_100."," ","The size available are size_10","The size available are size_20").toDF()
df.show(false)
+--------------------------------------------+
|value                                       |
+--------------------------------------------+
|The size available are size_10 and size_100.|
|                                            |
|The size available are size_10              |
|The size available are size_20              |
+--------------------------------------------+


df.select('value,split(regexp_replace('value, "(?:size_?)[^\\s]+","")," ").as("a"),split('value," ").as("b"))
  .select('value,split(regexp_replace(concat_ws(",",array_except('b,'a)),"[^0-9$,]",""),",").as("size"))
  .show(false)


+--------------------------------------------+---------+
|value                                       |size     |
+--------------------------------------------+---------+
|The size available are size_10 and size_100.|[10, 100]|
|                                            |[]       |
|The size available are size_10              |[10]     |
|The size available are size_20              |[20]     |
+--------------------------------------------+---------+

上述代码的Python等价物是:

df.withColumn('d',f.split(f.regexp_replace(f.concat_ws(',',f.array_except(f.split('data',' '),f.split(f.regexp_replace('data','(size_\d+)',''),' ')))
                                  ,"[^0-9$,]",""),',')).show(20,False)
如果您的数据集不太大,您也可以使用udf来完成

import re
from pyspark.sql.functions import udf

extract = udf(lambda s: list(map(lambda x: x.split('_')[1] if len(x)>0 else x,re.findall(r'(size_\d+)', s))), ArrayType(StringType()))

df.withColumn('values', extract('data')).show()

两种情况下的输出

+--------------------+---------+
|                data|   values|
+--------------------+---------+
|The size availabl...|[10, 100]|
|The size availabl...|     [10]|
|                    |       []|
|The size availabl...|     [20]|
|             size_10|     [10]|
+--------------------+---------+

所有函数都来自spark.sql.functions,它们在scala和python中是相同的,当然,您应该调整pythonWorks的语法,比如charm
+--------------------+---------+
|                data|   values|
+--------------------+---------+
|The size availabl...|[10, 100]|
|The size availabl...|     [10]|
|                    |       []|
|The size availabl...|     [20]|
|             size_10|     [10]|
+--------------------+---------+