Apache spark Spark-在数据集中查找最常见的单词

Apache spark Spark-在数据集中查找最常见的单词,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我的数据集如下所示: +----------+-----------------------------------------------------------------------------+ |prediction|text | +----------+---------------------------------------------

我的数据集如下所示:

+----------+-----------------------------------------------------------------------------+
|prediction|text                                                                         |
+----------+-----------------------------------------------------------------------------+
|1         |this is a important sentence important sentence                              |
|2         |this is a simple sentence, simple sentence and important sentence]           |
+----------+-----------------------------------------------------------------------------+
我想找出每行中最常见的N个单词(例如前2个单词)。 例如:

  • 第一行中最常用的词是“重要”和“句子”
  • 第二行中最常见的词是“简单”和“句子”
最后,创建一个新的数据集应如下所示:

+----------+-----------------------------------------------------------------------------+-----------+
|prediction|text                                                                         | top terms
+----------+-----------------------------------------------------------------------------+-----------+
|1         |this is a important sentence important sentence                              | important,sentence
|2         |this is a simple sentence, simple sentence and important sentence]           | simple,sentence
+----------+-----------------------------------------------------------------------------+-----------+
请给我一段代码来解决这个问题,我使用java,但是你使用的语言并不重要,因为我会转换它


谢谢

谢谢,但是我需要像我的帖子一样在同一行中为每个预测节目添加单词example@merchantappsep我添加了一些更改
# regex remove all nonalphanumeric characters, split by " " and explode words
# count each value group by 'prediction' and word
# add row number descending for counted words in each 'prediction'
# filter/select only top 2 words
# collect top terms to array using collect_list
# concat list values using array_join to return string

df.select('*,explode(split(regexp_replace('text,"[^a-zA-Z0-9 -]","")," ")).as("val"))
.groupBy('prediction,'val,'text)
.agg(count('val).as("val_cnt"))
.select('*,row_number.over(Window.partitionBy("prediction").orderBy('val_cnt.desc)).as("row_number"))
.where('row_number<=2)
.select('prediction,'text,'val)
.groupBy('prediction,'text).agg(collect_list('val).as("topterms"))
.select('prediction,'text,array_join('topterms,",").as("top_terms"))
.show(false)
+----------+------------------------------------------------------------------+------------------+
|prediction|text                                                              |top_terms         |
+----------+------------------------------------------------------------------+------------------+
|1         |this is a important sentence important sentence                   |sentence,important|
|2         |this is a simple sentence, simple sentence and important sentence]|sentence,simple   |
+----------+------------------------------------------------------------------+------------------+