Scala 如何将DF转换为添加另一列中包含字符串列表的列
假设我在scala中有一个关键字列表Scala 如何将DF转换为添加另一列中包含字符串列表的列,scala,apache-spark,Scala,Apache Spark,假设我在scala中有一个关键字列表 val关键字=列表(“菠萝”、“柠檬”) 和这样的数据帧 +---+-------------------------------------------+ |ID |Body | +---+-------------------------------------------+ |123|I contain both keywords pineapple and lemon| |4
val关键字=列表(“菠萝”、“柠檬”)
和这样的数据帧
+---+-------------------------------------------+
|ID |Body |
+---+-------------------------------------------+
|123|I contain both keywords pineapple and lemon|
|456|I sadly don't contain anything... |
|789|Pineapple's are delicious |
+---+-------------------------------------------+
如何将此数据框转换为具有Body
包含的关键字的新列?我想要的结果是
+---+-------------------------------------------+------------------+
|ID |Body |Contains_Keywords |
+---+-------------------------------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|
|456|I sadly don't contain anything... |[] |
|789|Pineapple's are delicious |[pineapple] |
+---+-------------------------------------------+------------------+
检查下面的代码
使用所需的示例数据创建dataframe
scala> val df = Seq(
(123,"I contain both keywords pineapple and lemon"),
(456,"I sadly don't contain anything"),
(789,"Pineapple's are delicious")).toDF("id","body")
df: org.apache.spark.sql.DataFrame = [id: int, body: string]
typedLit
向数据帧添加关键字
,并使用过滤器
高阶函数检查关键字
是否包含正文
列
scala> df
.withColumn("keywords",typedLit(keywords))
.withColumn("Contains_Keywords",expr("filter(keywords,keyword -> instr(lower(body),keyword) > 0)"))
.show(false)
最终产量
+---+-------------------------------------------+------------------+------------------+
|id |body |keywords |Contains_Keywords |
+---+-------------------------------------------+------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|[pineapple, lemon]|
|456|I sadly don't contain anything |[pineapple, lemon]|[] |
|789|Pineapple's are delicious |[pineapple, lemon]|[pineapple] |
+---+-------------------------------------------+------------------+------------------+
检查下面的代码
使用所需的示例数据创建dataframe
scala> val df = Seq(
(123,"I contain both keywords pineapple and lemon"),
(456,"I sadly don't contain anything"),
(789,"Pineapple's are delicious")).toDF("id","body")
df: org.apache.spark.sql.DataFrame = [id: int, body: string]
typedLit
向数据帧添加关键字
,并使用过滤器
高阶函数检查关键字
是否包含正文
列
scala> df
.withColumn("keywords",typedLit(keywords))
.withColumn("Contains_Keywords",expr("filter(keywords,keyword -> instr(lower(body),keyword) > 0)"))
.show(false)
最终产量
+---+-------------------------------------------+------------------+------------------+
|id |body |keywords |Contains_Keywords |
+---+-------------------------------------------+------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|[pineapple, lemon]|
|456|I sadly don't contain anything |[pineapple, lemon]|[] |
|789|Pineapple's are delicious |[pineapple, lemon]|[pineapple] |
+---+-------------------------------------------+------------------+------------------+
您可以将关键字列表转换为数据帧,然后根据
rlike
条件进行连接。最好在关键字前后添加\\\\b
来指定单词边界,这样可以防止部分匹配,例如apple
匹配spineapple
val result = df.as("df")
.join(keywords.toDF("keywords").as("keywords"),
expr("lower(df.body) rlike '\\\\b' || keywords.keywords || '\\\\b'"),
"left"
)
.groupBy("id", "body")
.agg(collect_list("keywords").as("Contains_keywords"))
result.show(false)
+---+-------------------------------------------+------------------+
|id |body |Contains_keywords |
+---+-------------------------------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|
|789|Pineapple's are delicious |[pineapple] |
|456|I sadly don't contain anything |[] |
+---+-------------------------------------------+------------------+
您可以将关键字列表转换为数据帧,然后根据
rlike
条件进行连接。最好在关键字前后添加\\\\b
来指定单词边界,这样可以防止部分匹配,例如apple
匹配spineapple
val result = df.as("df")
.join(keywords.toDF("keywords").as("keywords"),
expr("lower(df.body) rlike '\\\\b' || keywords.keywords || '\\\\b'"),
"left"
)
.groupBy("id", "body")
.agg(collect_list("keywords").as("Contains_keywords"))
result.show(false)
+---+-------------------------------------------+------------------+
|id |body |Contains_keywords |
+---+-------------------------------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|
|789|Pineapple's are delicious |[pineapple] |
|456|I sadly don't contain anything |[] |
+---+-------------------------------------------+------------------+