Scala 管道拟合及数据处理_Scala_Apache Spark_Pipeline

Scala 管道拟合及数据处理

scala apache-spark

Scala 管道拟合及数据处理,scala,apache-spark,pipeline,Scala,Apache Spark,Pipeline,我有一个包含文本的文件。我想做的是使用一个管道来标记文本，删除停止词并生成2克到目前为止我所做的：步骤1：读取文件 val data = sparkSession.read.text("data.txt").toDF("text") val pipe1 = new Tokenizer().setInputCol("text").setOutputCol("words") val pipe2 = new StopWordsRemover().setInputCol("words").setO

我有一个包含文本的文件。我想做的是使用一个管道来标记文本，删除停止词并生成2克

到目前为止我所做的：

步骤1：读取文件

val data = sparkSession.read.text("data.txt").toDF("text")

val pipe1 = new Tokenizer().setInputCol("text").setOutputCol("words")
val pipe2 = new StopWordsRemover().setInputCol("words").setOutputCol("filtered")
val pipe3 = new NGram().setN(2).setInputCol("filtered").setOutputCol("ngrams")

val pipeline = new Pipeline().setStages(Array(pipe1, pipe2, pipe3))
val model = pipeline.fit(data)

步骤2：构建管道

val data = sparkSession.read.text("data.txt").toDF("text")

val pipe1 = new Tokenizer().setInputCol("text").setOutputCol("words")
val pipe2 = new StopWordsRemover().setInputCol("words").setOutputCol("filtered")
val pipe3 = new NGram().setN(2).setInputCol("filtered").setOutputCol("ngrams")

val pipeline = new Pipeline().setStages(Array(pipe1, pipe2, pipe3))
val model = pipeline.fit(data)

我知道

pipeline.fit（data）

会生成

PipelineModel

，但是我不知道如何使用

PipelineModel

任何帮助都将不胜感激。

当您运行

val model=pipeline.fit（data）

code时，所有

估计器

阶段（即：分类、回归、聚类等机器学习任务）都与数据相适应，并创建一个

转换器

阶段。您只有

Transformer

阶段，因为您正在这个管道中创建特性

为了执行您的模型（现在只包含

Transformer

阶段），您需要运行

val results=model.transform（data）

。这将针对您的数据帧执行每个

Transformer

阶段。因此，在

model.transform（data）

过程的末尾，您将有一个数据帧，由原始行、标记器输出、StopWordsRemover输出以及最终的NGram结果组成

在功能创建完成后，可以通过SparkSQL查询发现前5个NGRAM。首先分解ngram列，然后按ngram分组计数，按计数列降序，然后执行

show（5）

。或者，您可以使用

“LIMIT 5

方法，而不是

show（5）

另一方面，您可能应该将对象名更改为非标准类名。否则，您将得到一个模糊的范围错误。

代码：

import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.Tokenizer
import org.apache.spark.sql.SparkSession._
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.NGram
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.ml.{Pipeline, PipelineModel}

object NGramPipeline {
    def main() {
        val sparkSession = SparkSession.builder.appName("NGram Pipeline").getOrCreate()

        val sc = sparkSession.sparkContext

        val data = sparkSession.read.text("quangle.txt").toDF("text")

        val pipe1 = new Tokenizer().setInputCol("text").setOutputCol("words")
        val pipe2 = new StopWordsRemover().setInputCol("words").setOutputCol("filtered")
        val pipe3 = new NGram().setN(2).setInputCol("filtered").setOutputCol("ngrams")

        val pipeline = new Pipeline().setStages(Array(pipe1, pipe2, pipe3))
        val model = pipeline.fit(data)

        val results = model.transform(data)

        val explodedNGrams = results.withColumn("explNGrams", explode($"ngrams"))
        explodedNGrams.groupBy("explNGrams").agg(count("*") as "ngramCount").orderBy(desc("ngramCount")).show(10,false)

    }
}
NGramPipeline.main()

+-----------------+----------+
|explNGrams       |ngramCount|
+-----------------+----------+
|quangle wangle   |9         |
|wangle quee.     |4         |
|'mr. quangle     |3         |
|said, --         |2         |
|wangle said      |2         |
|crumpetty tree   |2         |
|crumpetty tree,  |2         |
|quangle wangle,  |2         |
|crumpetty tree,--|2         |
|blue babboon,    |2         |
+-----------------+----------+
only showing top 10 rows

输出：

import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.Tokenizer
import org.apache.spark.sql.SparkSession._
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.NGram
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.ml.{Pipeline, PipelineModel}

object NGramPipeline {
    def main() {
        val sparkSession = SparkSession.builder.appName("NGram Pipeline").getOrCreate()

        val sc = sparkSession.sparkContext

        val data = sparkSession.read.text("quangle.txt").toDF("text")

        val pipe1 = new Tokenizer().setInputCol("text").setOutputCol("words")
        val pipe2 = new StopWordsRemover().setInputCol("words").setOutputCol("filtered")
        val pipe3 = new NGram().setN(2).setInputCol("filtered").setOutputCol("ngrams")

        val pipeline = new Pipeline().setStages(Array(pipe1, pipe2, pipe3))
        val model = pipeline.fit(data)

        val results = model.transform(data)

        val explodedNGrams = results.withColumn("explNGrams", explode($"ngrams"))
        explodedNGrams.groupBy("explNGrams").agg(count("*") as "ngramCount").orderBy(desc("ngramCount")).show(10,false)

    }
}
NGramPipeline.main()

+-----------------+----------+
|explNGrams       |ngramCount|
+-----------------+----------+
|quangle wangle   |9         |
|wangle quee.     |4         |
|'mr. quangle     |3         |
|said, --         |2         |
|wangle said      |2         |
|crumpetty tree   |2         |
|crumpetty tree,  |2         |
|quangle wangle,  |2         |
|crumpetty tree,--|2         |
|blue babboon,    |2         |
+-----------------+----------+
only showing top 10 rows

请注意，有一些语法（逗号、破折号等）会导致行重复。在执行ngrams时，过滤语法通常是一个好主意。您通常可以使用正则表达式进行过滤。

有关我所说的“转换器”的具体细节，您可以阅读“它是如何工作的”“管道文档的一节。谢谢你的回复。你能解释一下我应该怎么做才能得到5个最常见的2克吗？在您添加的代码中，显示了前5个2-gram（而不是最常见的5个）。当然，这只是创建ngram功能后的sql查询。它已添加到示例中。再次感谢您的更新。只有一个问题/澄清。为什么全格网乐出现在前两行？不应该合并吗？如果我错了，请纠正我。这些是ngram对，所以一对是（quangle wangle，qangle quee），另一对是（quangle wangle，wangle说）。