Apache spark Spark-在数据集中查找最常见的单词

Apache spark Spark-在数据集中查找最常见的单词,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我的数据集如下所示: +----------+-----------------------------------------------------------------------------+ |prediction|text | +----------+---------------------------------------------


|prediction|text                                                                         |
|1         |this is a important sentence important sentence                              |
|2         |this is a simple sentence, simple sentence and important sentence]           |
我想找出每行中最常见的N个单词(例如前2个单词)。 例如:

  • 第一行中最常用的词是“重要”和“句子”
  • 第二行中最常见的词是“简单”和“句子”

|prediction|text                                                                         | top terms
|1         |this is a important sentence important sentence                              | important,sentence
|2         |this is a simple sentence, simple sentence and important sentence]           | simple,sentence


# regex remove all nonalphanumeric characters, split by " " and explode words
# count each value group by 'prediction' and word
# add row number descending for counted words in each 'prediction'
# filter/select only top 2 words
# collect top terms to array using collect_list
# concat list values using array_join to return string

df.select('*,explode(split(regexp_replace('text,"[^a-zA-Z0-9 -]","")," ")).as("val"))
|prediction|text                                                              |top_terms         |
|1         |this is a important sentence important sentence                   |sentence,important|
|2         |this is a simple sentence, simple sentence and important sentence]|sentence,simple   |