Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/templates/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 如何将DF转换为添加另一列中包含字符串列表的列_Scala_Apache Spark - Fatal编程技术网

Scala 如何将DF转换为添加另一列中包含字符串列表的列

Scala 如何将DF转换为添加另一列中包含字符串列表的列,scala,apache-spark,Scala,Apache Spark,假设我在scala中有一个关键字列表 val关键字=列表(“菠萝”、“柠檬”) 和这样的数据帧 +---+-------------------------------------------+ |ID |Body | +---+-------------------------------------------+ |123|I contain both keywords pineapple and lemon| |4

假设我在scala中有一个关键字列表

val关键字=列表(“菠萝”、“柠檬”)

和这样的数据帧

+---+-------------------------------------------+
|ID |Body                                       |
+---+-------------------------------------------+
|123|I contain both keywords pineapple and lemon|
|456|I sadly don't contain anything...          |
|789|Pineapple's are delicious                  |
+---+-------------------------------------------+
如何将此数据框转换为具有
Body
包含的关键字的新列?我想要的结果是

+---+-------------------------------------------+------------------+
|ID |Body                                       |Contains_Keywords |
+---+-------------------------------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|
|456|I sadly don't contain anything...          |[]                |
|789|Pineapple's are delicious                  |[pineapple]       |
+---+-------------------------------------------+------------------+
检查下面的代码

使用所需的示例数据创建dataframe

scala> val df = Seq(
      (123,"I contain both keywords pineapple and lemon"),
      (456,"I sadly don't contain anything"),
      (789,"Pineapple's are delicious")).toDF("id","body")

df: org.apache.spark.sql.DataFrame = [id: int, body: string]
typedLit
向数据帧添加
关键字
,并使用
过滤器
高阶函数检查
关键字
是否包含
正文

scala> df
.withColumn("keywords",typedLit(keywords))
.withColumn("Contains_Keywords",expr("filter(keywords,keyword -> instr(lower(body),keyword) > 0)"))
.show(false)
最终产量

+---+-------------------------------------------+------------------+------------------+
|id |body                                       |keywords          |Contains_Keywords |
+---+-------------------------------------------+------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|[pineapple, lemon]|
|456|I sadly don't contain anything             |[pineapple, lemon]|[]                |
|789|Pineapple's are delicious                  |[pineapple, lemon]|[pineapple]       |
+---+-------------------------------------------+------------------+------------------+
检查下面的代码

使用所需的示例数据创建dataframe

scala> val df = Seq(
      (123,"I contain both keywords pineapple and lemon"),
      (456,"I sadly don't contain anything"),
      (789,"Pineapple's are delicious")).toDF("id","body")

df: org.apache.spark.sql.DataFrame = [id: int, body: string]
typedLit
向数据帧添加
关键字
,并使用
过滤器
高阶函数检查
关键字
是否包含
正文

scala> df
.withColumn("keywords",typedLit(keywords))
.withColumn("Contains_Keywords",expr("filter(keywords,keyword -> instr(lower(body),keyword) > 0)"))
.show(false)
最终产量

+---+-------------------------------------------+------------------+------------------+
|id |body                                       |keywords          |Contains_Keywords |
+---+-------------------------------------------+------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|[pineapple, lemon]|
|456|I sadly don't contain anything             |[pineapple, lemon]|[]                |
|789|Pineapple's are delicious                  |[pineapple, lemon]|[pineapple]       |
+---+-------------------------------------------+------------------+------------------+

您可以将关键字列表转换为数据帧,然后根据
rlike
条件进行连接。最好在关键字前后添加
\\\\b
来指定单词边界,这样可以防止部分匹配,例如
apple
匹配
spineapple

val result = df.as("df")
    .join(keywords.toDF("keywords").as("keywords"), 
          expr("lower(df.body) rlike '\\\\b' || keywords.keywords || '\\\\b'"), 
          "left"
         )
    .groupBy("id", "body")
    .agg(collect_list("keywords").as("Contains_keywords"))

result.show(false)
+---+-------------------------------------------+------------------+
|id |body                                       |Contains_keywords |
+---+-------------------------------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|
|789|Pineapple's are delicious                  |[pineapple]       |
|456|I sadly don't contain anything             |[]                |
+---+-------------------------------------------+------------------+

您可以将关键字列表转换为数据帧,然后根据
rlike
条件进行连接。最好在关键字前后添加
\\\\b
来指定单词边界,这样可以防止部分匹配,例如
apple
匹配
spineapple

val result = df.as("df")
    .join(keywords.toDF("keywords").as("keywords"), 
          expr("lower(df.body) rlike '\\\\b' || keywords.keywords || '\\\\b'"), 
          "left"
         )
    .groupBy("id", "body")
    .agg(collect_list("keywords").as("Contains_keywords"))

result.show(false)
+---+-------------------------------------------+------------------+
|id |body                                       |Contains_keywords |
+---+-------------------------------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|
|789|Pineapple's are delicious                  |[pineapple]       |
|456|I sadly don't contain anything             |[]                |
+---+-------------------------------------------+------------------+