Apache spark 在Spark数据帧上执行NGram_Apache Spark_Pyspark_Apache Spark Sql_Apache Spark Mllib_Apache Spark Ml

Apache spark 在Spark数据帧上执行NGram

apache-spark pyspark

Apache spark 在Spark数据帧上执行NGram,apache-spark,pyspark,apache-spark-sql,apache-spark-mllib,apache-spark-ml,Apache Spark,Pyspark,Apache Spark Sql,Apache Spark Mllib,Apache Spark Ml,我使用的是Spark 2.3.1，我有这样的Spark数据帧 +----------+ | values| +----------+ |embodiment| | present| | invention| | include| | pairing| | two| | wireless| | device| | placing| | least| | one| | two| +----------+ 我想执行这样的Spa

我使用的是Spark 2.3.1，我有这样的Spark数据帧

+----------+
|    values|
+----------+
|embodiment|
|   present|
| invention|
|   include|
|   pairing|
|       two|
|  wireless|
|    device|
|   placing|
|     least|
|       one|
|       two|
+----------+

我想执行这样的Spark ml n-Gram功能

bigram = NGram(n=2, inputCol="values", outputCol="bigrams")

bigramDataFrame = bigram.transform(tokenized_df)

此行出现以下错误bigramDataFrame=bigram.transform（标记化的\u df）

pyspark.sql.utils.IllegalArgumentException:“要求失败：输入类型必须是ArrayType（StringType），但得到了StringType。”

所以我改变了密码

df_new = tokenized_df.withColumn("testing", array(tokenized_df["values"]))

bigram = NGram(n=2, inputCol="values", outputCol="bigrams")

bigramDataFrame = bigram.transform(df_new)

bigramDataFrame.show()

所以我得到了我的最终数据帧，如下所示

+----------+------------+-------+
|    values|     testing|bigrams|
+----------+------------+-------+
|embodiment|[embodiment]|     []|
|   present|   [present]|     []|
| invention| [invention]|     []|
|   include|   [include]|     []|
|   pairing|   [pairing]|     []|
|       two|       [two]|     []|
|  wireless|  [wireless]|     []|
|    device|    [device]|     []|
|   placing|   [placing]|     []|
|     least|     [least]|     []|
|       one|       [one]|     []|
|       two|       [two]|     []|
+----------+------------+-------+

为什么我的bigram列值为空

我希望我的输出为bigram列，如下所示

+----------+
|   bigrams|
+--------------------+
|embodiment present  |
|present invention   |
|invention include   |
|include pairing     |
|pairing two         |
|two wireless        |
|wireless device     |
|device placing      |
|placing least       |
|least one           |
|one two             |
+--------------------+

bi gram列值为空，因为“values”列的每一行中都没有bi gram

如果输入数据框中的值如下所示：

+--------------------------------------------+
|values                                      |
+--------------------------------------------+
|embodiment present invention include pairing|
|two wireless device placing                 |
|least one two                               |
+--------------------------------------------+

然后，您可以获得如下所示的双克输出：

+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
|values                                      |testing                                           |ngrams                                                                     |
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
|embodiment present invention include pairing|[embodiment, present, invention, include, pairing]|[embodiment present, present invention, invention include, include pairing]|
|two wireless device placing                 |[two, wireless, device, placing]                  |[two wireless, wireless device, device placing]                            |
|least one two                               |[least, one, two]                                 |[least one, one two]                                                       |
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+

执行此操作的scala spark代码是：

val df_new = df.withColumn("testing", split(df("values")," "))
val ngram = new NGram().setN(2).setInputCol("testing").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(df_new)

双元图是一个字符串中两个相邻元素的序列标记，通常是字母、音节或单词

但是在您的输入数据帧中，每行中只有一个令牌，因此您无法从中获得任何bi-gram

所以，对于你的问题，你可以这样做：

Input: df1
+----------+
|values    |
+----------+
|embodiment|
|present   |
|invention |
|include   |
|pairing   |
|two       |
|wireless  |
|devic     |
|placing   |
|least     |
|one       |
|two       |
+----------+

Output: ngramDataFrameInRows
+------------------+
|ngrams            |
+------------------+
|embodiment present|
|present invention |
|invention include |
|include pairing   |
|pairing two       |
|two wireless      |
|wireless devic    |
|devic placing     |
|placing least     |
|least one         |
|one two           |
+------------------+

Spark scala代码：

val df_new=df1.agg(collect_list("values").alias("testing"))
val ngram = new NGram().setN(2).setInputCol("testing").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(df_new)
val ngramDataFrameInRows=ngramDataFrame.select(explode(col("ngrams")).alias("ngrams"))

你想要类似的东西吗：

df.select（F.concat_ws（“），F.col（“值”），F.lead（“值”）。over（Window.orderBy（F.lit（None（“无”））））.show（）

？@anky你的建议是对的，你能解释一下这篇文章的答案吗，并请建议我如何具体化三行或四行。我自己也试过，但没用。你知道为什么n-gram的pyspark ml lib功能不起作用吗（虽然我在HDP沙盒中运行了相同的代码，但它的工作方式与预期的spark版本相同）顺便说一句，我在本地使用spark submit命令运行spark。