Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/list/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
List 将字符串拆分为单词,检查单词是否与列表项匹配,并将该单词作为新列的值返回_List_Apache Spark_Pyspark_Split_Apache Spark Sql - Fatal编程技术网

List 将字符串拆分为单词,检查单词是否与列表项匹配,并将该单词作为新列的值返回

List 将字符串拆分为单词,检查单词是否与列表项匹配,并将该单词作为新列的值返回,list,apache-spark,pyspark,split,apache-spark-sql,List,Apache Spark,Pyspark,Split,Apache Spark Sql,我有一个数据框,其中列text包含一个字符串(或Null) 如果text列中的单词长度>=6且可以使用regexp\u extract获取相关字符串: import pyspark.sql.functions as F pattern = '|'.join([rf'{word}' for word in word_list if len(word) >= 6 and len(word) <= 11]) df2 = df.withColumn( 'match', F

我有一个数据框,其中列
text
包含一个字符串(或Null)


如果
text
列中的单词长度>=6且可以使用
regexp\u extract
获取相关字符串:

import pyspark.sql.functions as F

pattern = '|'.join([rf'{word}' for word in word_list if len(word) >= 6 and len(word) <= 11])

df2 = df.withColumn(
    'match',
    F.regexp_extract(
        'text',
        rf"\b({pattern})\b",
        1
    )
).withColumn(
    'match',
    F.when(F.col('match') != '', F.col('match'))    # replace no match with null
)

df2.show(truncate=False)
+----------------------------------+------------+
|text                              |match       |
+----------------------------------+------------+
|This is line one                  |Null        |
|This is line two                  |Null        |
|bla coroner foo bar               |coroner     |
|This is line three                |Null        |
|foo bar shakespeare               |shakespeare |
|Null                              |Null        |
+----------------------------------+------------+
导入pyspark.sql.F函数

pattern='|'。如果len(word)>=6且len(word)您可以使用
regexp\u extract
获取相关字符串,则将[rf'{word}用于word\u列表中的word:

import pyspark.sql.functions as F

pattern = '|'.join([rf'{word}' for word in word_list if len(word) >= 6 and len(word) <= 11])

df2 = df.withColumn(
    'match',
    F.regexp_extract(
        'text',
        rf"\b({pattern})\b",
        1
    )
).withColumn(
    'match',
    F.when(F.col('match') != '', F.col('match'))    # replace no match with null
)

df2.show(truncate=False)
+----------------------------------+------------+
|text                              |match       |
+----------------------------------+------------+
|This is line one                  |Null        |
|This is line two                  |Null        |
|bla coroner foo bar               |coroner     |
|This is line three                |Null        |
|foo bar shakespeare               |shakespeare |
|Null                              |Null        |
+----------------------------------+------------+
导入pyspark.sql.F函数

pattern='|'。如果len(word)>=6且len(word)则将[rf'{word}用于word_列表中的word,您可以使用列表
word_列表
作为数组文字,并检查数组与列
text
的交点:

from pyspark.sql import functions as F

word_list_arr = F.array(*[F.lit(w) for w in word_list if len(w) >= 6 and len(w) <= 11])

df1 = df.withColumn(
    "match",
    F.array_join(F.array_intersect(F.split("text", " "), word_list_arr), " ")
).withColumn("match", F.expr("nullif(match, '')"))

df1.show(truncate=False)
#+----------------------------------+------------+
#|text                              |match       |
#+----------------------------------+------------+
#|This is line one                  |Null        |
#|This is line two                  |Null        |
#|bla coroner foo bar               |coroner     |
#|This is line three                |Null        |
#|foo bar shakespeare               |shakespeare |
#|Null                              |Null        |
#+----------------------------------+------------+
从pyspark.sql导入函数为F

word\u list\u arr=F.array(*[F.lit(w)表示word\u list中的w,如果len(w)>=6且len(w),则可以使用列表
word\u list
作为数组文字,并检查数组与列
文本的交点:

from pyspark.sql import functions as F

word_list_arr = F.array(*[F.lit(w) for w in word_list if len(w) >= 6 and len(w) <= 11])

df1 = df.withColumn(
    "match",
    F.array_join(F.array_intersect(F.split("text", " "), word_list_arr), " ")
).withColumn("match", F.expr("nullif(match, '')"))

df1.show(truncate=False)
#+----------------------------------+------------+
#|text                              |match       |
#+----------------------------------+------------+
#|This is line one                  |Null        |
#|This is line two                  |Null        |
#|bla coroner foo bar               |coroner     |
#|This is line three                |Null        |
#|foo bar shakespeare               |shakespeare |
#|Null                              |Null        |
#+----------------------------------+------------+
从pyspark.sql导入函数为F
如果len(w)>=6和len(w),则word_列表中的w为F.lit(w)数组(*[F.lit(w)