Apache spark Spark-在数据集中查找最常见的单词_Apache Spark_Pyspark_Apache Spark Sql

Apache spark Spark-在数据集中查找最常见的单词

apache-spark pyspark

Apache spark Spark-在数据集中查找最常见的单词,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我的数据集如下所示： +----------+-----------------------------------------------------------------------------+ |prediction|text | +----------+---------------------------------------------

我的数据集如下所示：

+----------+-----------------------------------------------------------------------------+
|prediction|text                                                                         |
+----------+-----------------------------------------------------------------------------+
|1         |this is a important sentence important sentence                              |
|2         |this is a simple sentence, simple sentence and important sentence]           |
+----------+-----------------------------------------------------------------------------+

我想找出每行中最常见的N个单词（例如前2个单词）。例如：

第一行中最常用的词是“重要”和“句子”
第二行中最常见的词是“简单”和“句子”

最后，创建一个新的数据集应如下所示：

+----------+-----------------------------------------------------------------------------+-----------+
|prediction|text                                                                         | top terms
+----------+-----------------------------------------------------------------------------+-----------+
|1         |this is a important sentence important sentence                              | important,sentence
|2         |this is a simple sentence, simple sentence and important sentence]           | simple,sentence
+----------+-----------------------------------------------------------------------------+-----------+

请给我一段代码来解决这个问题，我使用java，但是你使用的语言并不重要，因为我会转换它

谢谢

谢谢，但是我需要像我的帖子一样在同一行中为每个预测节目添加单词example@merchantappsep我添加了一些更改

# regex remove all nonalphanumeric characters, split by " " and explode words
# count each value group by 'prediction' and word
# add row number descending for counted words in each 'prediction'
# filter/select only top 2 words
# collect top terms to array using collect_list
# concat list values using array_join to return string

df.select('*,explode(split(regexp_replace('text,"[^a-zA-Z0-9 -]","")," ")).as("val"))
.groupBy('prediction,'val,'text)
.agg(count('val).as("val_cnt"))
.select('*,row_number.over(Window.partitionBy("prediction").orderBy('val_cnt.desc)).as("row_number"))
.where('row_number<=2)
.select('prediction,'text,'val)
.groupBy('prediction,'text).agg(collect_list('val).as("topterms"))
.select('prediction,'text,array_join('topterms,",").as("top_terms"))
.show(false)

+----------+------------------------------------------------------------------+------------------+
|prediction|text                                                              |top_terms         |
+----------+------------------------------------------------------------------+------------------+
|1         |this is a important sentence important sentence                   |sentence,important|
|2         |this is a simple sentence, simple sentence and important sentence]|sentence,simple   |
+----------+------------------------------------------------------------------+------------------+