List 将字符串拆分为单词，检查单词是否与列表项匹配，并将该单词作为新列的值返回_List_Apache Spark_Pyspark_Split_Apache Spark Sql

List 将字符串拆分为单词，检查单词是否与列表项匹配，并将该单词作为新列的值返回

list apache-spark pyspark

List 将字符串拆分为单词，检查单词是否与列表项匹配，并将该单词作为新列的值返回,list,apache-spark,pyspark,split,apache-spark-sql,List,Apache Spark,Pyspark,Split,Apache Spark Sql,我有一个数据框，其中列text包含一个字符串（或Null）如果text列中的单词长度>=6且可以使用regexp\u extract获取相关字符串： import pyspark.sql.functions as F pattern = '|'.join([rf'{word}' for word in word_list if len(word) >= 6 and len(word) <= 11]) df2 = df.withColumn( 'match', F

我有一个数据框，其中列

text

包含一个字符串（或Null）

如果

text

列中的单词长度>=6且可以使用

regexp\u extract

获取相关字符串：

import pyspark.sql.functions as F

pattern = '|'.join([rf'{word}' for word in word_list if len(word) >= 6 and len(word) <= 11])

df2 = df.withColumn(
    'match',
    F.regexp_extract(
        'text',
        rf"\b({pattern})\b",
        1
    )
).withColumn(
    'match',
    F.when(F.col('match') != '', F.col('match'))    # replace no match with null
)

df2.show(truncate=False)
+----------------------------------+------------+
|text                              |match       |
+----------------------------------+------------+
|This is line one                  |Null        |
|This is line two                  |Null        |
|bla coroner foo bar               |coroner     |
|This is line three                |Null        |
|foo bar shakespeare               |shakespeare |
|Null                              |Null        |
+----------------------------------+------------+

导入pyspark.sql.F函数
pattern='|'。如果len（word）>=6且len（word）您可以使用regexp\u extract
获取相关字符串，则将[rf'{word}用于word\u列表中的word：
import pyspark.sql.functions as F

pattern = '|'.join([rf'{word}' for word in word_list if len(word) >= 6 and len(word) <= 11])

df2 = df.withColumn(
    'match',
    F.regexp_extract(
        'text',
        rf"\b({pattern})\b",
        1
    )
).withColumn(
    'match',
    F.when(F.col('match') != '', F.col('match'))    # replace no match with null
)

df2.show(truncate=False)
+----------------------------------+------------+
|text                              |match       |
+----------------------------------+------------+
|This is line one                  |Null        |
|This is line two                  |Null        |
|bla coroner foo bar               |coroner     |
|This is line three                |Null        |
|foo bar shakespeare               |shakespeare |
|Null                              |Null        |
+----------------------------------+------------+

导入pyspark.sql.F函数
pattern='|'。如果len（word）>=6且len（word）则将[rf'{word}用于word_列表中的word，您可以使用列表word_列表
作为数组文字，并检查数组与列text
的交点：
from pyspark.sql import functions as F

word_list_arr = F.array(*[F.lit(w) for w in word_list if len(w) >= 6 and len(w) <= 11])

df1 = df.withColumn(
    "match",
    F.array_join(F.array_intersect(F.split("text", " "), word_list_arr), " ")
).withColumn("match", F.expr("nullif(match, '')"))

df1.show(truncate=False)
#+----------------------------------+------------+
#|text                              |match       |
#+----------------------------------+------------+
#|This is line one                  |Null        |
#|This is line two                  |Null        |
#|bla coroner foo bar               |coroner     |
#|This is line three                |Null        |
#|foo bar shakespeare               |shakespeare |
#|Null                              |Null        |
#+----------------------------------+------------+

从pyspark.sql导入函数为F
word\u list\u arr=F.array（*[F.lit（w）表示word\u list中的w，如果len（w）>=6且len（w），则可以使用列表word\u list
作为数组文字，并检查数组与列文本的交点：
from pyspark.sql import functions as F

word_list_arr = F.array(*[F.lit(w) for w in word_list if len(w) >= 6 and len(w) <= 11])

df1 = df.withColumn(
    "match",
    F.array_join(F.array_intersect(F.split("text", " "), word_list_arr), " ")
).withColumn("match", F.expr("nullif(match, '')"))

df1.show(truncate=False)
#+----------------------------------+------------+
#|text                              |match       |
#+----------------------------------+------------+
#|This is line one                  |Null        |
#|This is line two                  |Null        |
#|bla coroner foo bar               |coroner     |
#|This is line three                |Null        |
#|foo bar shakespeare               |shakespeare |
#|Null                              |Null        |
#+----------------------------------+------------+

从pyspark.sql导入函数为F
如果len（w）>=6和len（w），则word_列表中的w为F.lit（w）数组（*[F.lit（w）