Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark MLin Word2vec_Scala_Apache Spark_Apache Spark Mllib_Word2vec - Fatal编程技术网

Scala Spark MLin Word2vec

Scala Spark MLin Word2vec,scala,apache-spark,apache-spark-mllib,word2vec,Scala,Apache Spark,Apache Spark Mllib,Word2vec,我正在尝试运行Spark MLlibs word2vec实现。我正在为此使用scala。我对模型的输入是字符串序列数组。它看起来如下所示 scala> f.take(5) res11: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0_42)], [WrappedArray(big, baller, shoe, ?)], [WrappedArray(since, eliud, win, ,, quick, fact, from,

我正在尝试运行Spark MLlibs word2vec实现。我正在为此使用scala。我对模型的输入是字符串序列数组。它看起来如下所示

scala> f.take(5)
res11: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0_42)], [WrappedArray(big, baller, shoe, ?)], [WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, tribe, become, future, kal...

val v=f.map(l=>Seq(l.toString))
scala> v.take(5)
res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List  ([WrappedArray(big, baller, shoe, ?)]), List([WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, ....
如上图所示,每个句子都在一个单独的列表中

scala> val model = word2vec.fit(v)
但是这个模型的输出看起来并不合适。当我保存模型并尝试读取其拼花文件(a)时,我得到以下结果

   model.save(sc, "myModelPath")
   val a=sqlContext.read.parquet("myModelPath")
   a.show(20,false)
+--------------------------------------------------------------------+
|word                                                                |
+--------------------------------------------------------------------+
|[WrappedArray(coffee, machine)]                                     |
|[WrappedArray(good, experience)]                                    |
|[WrappedArray(love, room, !)]                                       |
|[WrappedArray(parking, .)]                                          |
|[WrappedArray(breakfast, great, !)]                                 |
|[WrappedArray(bed, comfortable, room, spacious, .)]                 |
这个word2vec模型不是为每个单词创建向量,而是为单词数组创建向量。
我不确定向这个模型输入信息的正确方法是什么,它是如何断句的。

我敢打赌,如果你看
v.first
你会看到
列表([WrappedArray(042)])
如果你看
v.first.head
你会看到
[WrappedArray(042)]
。但是
v.first.head
是一个字符串,您实际看到的是
“[WrappedArray(0_42)]”
。没有包裹,只有一根绳子。可能您在
WrappedArray
上意外调用了
toString
(或者成为隐式转换为字符串的牺牲品)。Word2Vec实际上在其输入中看到了类似
“[WrappedArray(咖啡,机器)]”
的字符串,并基于这些字符串生成了一个模型

更新

如果我没有弄错您的类型,f是一个
数据帧
,其中每个
都包含一个包含
Seq[String]
(实际上是一个
WrappedArray
)的字段

因此,与其

val v=f.map(l=>Seq(l.toString))
要提取该字段,您应该做的是

val v = f.map(r => r.getSeq[String](0))

这将生成一个
数据集[Seq[String]]
,该数据集应该适合输入
Word2Vec

是的,您是对的,我在wrappedArray上调用了tostring。我必须将数组[org.apache.spark.sql.Row]转换为Seq[String]。我编辑了这个问题,向您展示了我的操作。您能告诉我转换此输入的正确方法吗?更新了我的答案,并提供了建议的方法。