List 将字符串拆分为单词,检查单词是否与列表项匹配,并将该单词作为新列的值返回
我有一个数据框,其中列List 将字符串拆分为单词,检查单词是否与列表项匹配,并将该单词作为新列的值返回,list,apache-spark,pyspark,split,apache-spark-sql,List,Apache Spark,Pyspark,Split,Apache Spark Sql,我有一个数据框,其中列text包含一个字符串(或Null) 如果text列中的单词长度>=6且可以使用regexp\u extract获取相关字符串: import pyspark.sql.functions as F pattern = '|'.join([rf'{word}' for word in word_list if len(word) >= 6 and len(word) <= 11]) df2 = df.withColumn( 'match', F
text
包含一个字符串(或Null)
如果
text
列中的单词长度>=6且可以使用regexp\u extract
获取相关字符串:
import pyspark.sql.functions as F
pattern = '|'.join([rf'{word}' for word in word_list if len(word) >= 6 and len(word) <= 11])
df2 = df.withColumn(
'match',
F.regexp_extract(
'text',
rf"\b({pattern})\b",
1
)
).withColumn(
'match',
F.when(F.col('match') != '', F.col('match')) # replace no match with null
)
df2.show(truncate=False)
+----------------------------------+------------+
|text |match |
+----------------------------------+------------+
|This is line one |Null |
|This is line two |Null |
|bla coroner foo bar |coroner |
|This is line three |Null |
|foo bar shakespeare |shakespeare |
|Null |Null |
+----------------------------------+------------+
导入pyspark.sql.F函数
pattern='|'。如果len(word)>=6且len(word)您可以使用regexp\u extract
获取相关字符串,则将[rf'{word}用于word\u列表中的word:
import pyspark.sql.functions as F
pattern = '|'.join([rf'{word}' for word in word_list if len(word) >= 6 and len(word) <= 11])
df2 = df.withColumn(
'match',
F.regexp_extract(
'text',
rf"\b({pattern})\b",
1
)
).withColumn(
'match',
F.when(F.col('match') != '', F.col('match')) # replace no match with null
)
df2.show(truncate=False)
+----------------------------------+------------+
|text |match |
+----------------------------------+------------+
|This is line one |Null |
|This is line two |Null |
|bla coroner foo bar |coroner |
|This is line three |Null |
|foo bar shakespeare |shakespeare |
|Null |Null |
+----------------------------------+------------+
导入pyspark.sql.F函数
pattern='|'。如果len(word)>=6且len(word)则将[rf'{word}用于word_列表中的word,您可以使用列表word_列表
作为数组文字,并检查数组与列text
的交点:
from pyspark.sql import functions as F
word_list_arr = F.array(*[F.lit(w) for w in word_list if len(w) >= 6 and len(w) <= 11])
df1 = df.withColumn(
"match",
F.array_join(F.array_intersect(F.split("text", " "), word_list_arr), " ")
).withColumn("match", F.expr("nullif(match, '')"))
df1.show(truncate=False)
#+----------------------------------+------------+
#|text |match |
#+----------------------------------+------------+
#|This is line one |Null |
#|This is line two |Null |
#|bla coroner foo bar |coroner |
#|This is line three |Null |
#|foo bar shakespeare |shakespeare |
#|Null |Null |
#+----------------------------------+------------+
从pyspark.sql导入函数为F
word\u list\u arr=F.array(*[F.lit(w)表示word\u list中的w,如果len(w)>=6且len(w),则可以使用列表word\u list
作为数组文字,并检查数组与列文本的交点:
from pyspark.sql import functions as F
word_list_arr = F.array(*[F.lit(w) for w in word_list if len(w) >= 6 and len(w) <= 11])
df1 = df.withColumn(
"match",
F.array_join(F.array_intersect(F.split("text", " "), word_list_arr), " ")
).withColumn("match", F.expr("nullif(match, '')"))
df1.show(truncate=False)
#+----------------------------------+------------+
#|text |match |
#+----------------------------------+------------+
#|This is line one |Null |
#|This is line two |Null |
#|bla coroner foo bar |coroner |
#|This is line three |Null |
#|foo bar shakespeare |shakespeare |
#|Null |Null |
#+----------------------------------+------------+
从pyspark.sql导入函数为F
如果len(w)>=6和len(w),则word_列表中的w为F.lit(w)数组(*[F.lit(w)