Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 将稀疏特征向量分解为单独的列_Scala_Apache Spark_Apache Spark Sql_Apache Spark Mllib_Apache Spark Ml - Fatal编程技术网

Scala 将稀疏特征向量分解为单独的列

Scala 将稀疏特征向量分解为单独的列,scala,apache-spark,apache-spark-sql,apache-spark-mllib,apache-spark-ml,Scala,Apache Spark,Apache Spark Sql,Apache Spark Mllib,Apache Spark Ml,在我的spark数据帧中,有一列包含CountVectoriser转换的输出-它是稀疏向量格式。我试图做的是将这个列再次“分解”成一个密集的向量,然后是它的组件行,这样它就可以被外部模型用来评分 我知道专栏中有40个功能,因此我尝试了以下示例: import org.apache.spark.sql.functions.udf import org.apache.spark.mllib.linalg.Vector // convert sparse vector to a dense vect

在我的spark数据帧中,有一列包含CountVectoriser转换的输出-它是稀疏向量格式。我试图做的是将这个列再次“分解”成一个密集的向量,然后是它的组件行,这样它就可以被外部模型用来评分

我知道专栏中有40个功能,因此我尝试了以下示例:

import org.apache.spark.sql.functions.udf
import org.apache.spark.mllib.linalg.Vector

// convert sparse vector to a dense vector, and then to array<double> 
val vecToSeq = udf((v: Vector) => v.toArray)

// Prepare a list of columns to create
val exprs = (0 until 39).map(i => $"_tmp".getItem(i).alias(s"exploded_col$i"))
testDF.select(vecToSeq($"features").alias("_tmp")).select(exprs:_*)
然后我得到一个由以下原因引起的错误:

Caused by: java.lang.ClassCastException: org.apache.spark.ml.linalg.SparseVector cannot be cast to org.apache.spark.sql.Row
我还尝试通过将UDF更改为以下值来转换ml向量:

val vecToSeq = udf((v: Vector) =>  org.apache.spark.mllib.linalg.Vectors.fromML(v.toDense).toArray )
并获取一个类似的无法转换为org.apache.spark.sql.Row的错误。有人能告诉我为什么这不起作用吗?有没有更简单的方法将数据帧中的稀疏向量分解为sperate列?我已经花了好几个小时在这上面了,但我想不出来

编辑:模式将要素列显示为向量:

  |-- features: vector (nullable = true)
完整错误跟踪:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(features)' due to data type mismatch: argument 1 requires vector type, however, 'features' is of vector type.;;
Project [UDF(features#325) AS _tmp#463]
. . . 
org.apache.spark.sql.cassandra.CassandraSourceRelation@47eae91d

        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
        at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
        at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
        at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
        at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:279)
        at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:289)
        at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:293)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:293)
        at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:298)
        at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
        at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:298)
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
        at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:66)
        at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2872)
        at org.apache.spark.sql.Dataset.select(Dataset.scala:1153)
        at uk.nominet.renewals.prediction_test$.prediction_test(prediction_test.scala:292)
        at 

在处理此类案例时,我经常一步一步地分解,以了解问题的来源

首先,让我们设置一个数据帧:

import org.apache.spark.ml.feature.CountVectorizer
import org.apache.spark.ml.linalg.Vector
val df=sc.parallelize(Seq((1L, Seq("word1", "word2")))).toDF("id", "words")
val countModel = new CountVectorizer().setInputCol("words").setOutputCol("feature").fit(df)
val testDF = countModel.transform(df)
testDF.show

+---+--------------+-------------------+
| id|         words|            feature|
+---+--------------+-------------------+
|  1|[word1, word2]|(2,[0,1],[1.0,1.0])|
+---+--------------+-------------------+
现在,我想选择的是,比如说特征的第一列,也就是说,提取特征向量的第一个坐标

可以写成:v0。 现在,我想让我的数据框有一个列,其中包含v0,其中v是功能列的内容。我可以使用用户定义的函数:

val firstColumnExtractor = udf((v: Vector) => v(0))
我尝试将此列添加到我的testDF中

请注意,我也可以这样做。就我所知,这只是一个风格问题:

testDF.select(firstColumnExtractor($"feature").as("feature_0")).show
这是可行的,但需要重复很多工作。让我们实现自动化。 首先,我可以将提取函数推广到任何索引。让我们创建一个高阶函数一个创建函数的函数

def columnExtractor(idx: Int) = udf((v: Vector) => v(idx))
现在,我可以重写前面的示例:

testDF.withColumn("feature_0", columnExtractor(0)($"feature")).show
好的,现在我可以这样做:

testDF.withColumn("feature_0", columnExtractor(0)($"feature"))
      .withColumn("feature_1", columnExtractor(1)($"feature"))
// A function to densify the feature vector
val toDense = udf((v:Vector) => v.toDense)
// Replase testDF's feature column with its dense equivalent
val denseDF = testDF.withColumn("feature", toDense($"feature"))
// Work on denseDF as we did on testDF 
denseDF.select((col("*") +: featureCols):_*).show
这适用于1,但是39维呢?好吧,让我们自动化更多。以上实际上是对每个维度的折叠操作:

(0 to 39).foldLeft(testDF)((df, idx) => df.withColumn("feature_"+idx, columnExtractor(idx)($"feature")))
这只是使用多个选择编写函数的另一种方式

val featureCols = (0 to 1).map(idx => columnExtractor(idx)($"feature").as("feature_"+idx))
testDF.select((col("*") +: featureCols):_*).show
+---+--------------+-------------------+---------+---------+
| id|         words|            feature|feature_0|feature_1|
+---+--------------+-------------------+---------+---------+
|  1|[word1, word2]|(2,[0,1],[1.0,1.0])|      1.0|      1.0|
+---+--------------+-------------------+---------+---------+
现在,出于性能原因,您可能希望将基向量转换为坐标数组或密度向量。请随意这样做。我觉得DenseVector或阵列的性能可能非常接近,所以我会这样写:

testDF.withColumn("feature_0", columnExtractor(0)($"feature"))
      .withColumn("feature_1", columnExtractor(1)($"feature"))
// A function to densify the feature vector
val toDense = udf((v:Vector) => v.toDense)
// Replase testDF's feature column with its dense equivalent
val denseDF = testDF.withColumn("feature", toDense($"feature"))
// Work on denseDF as we did on testDF 
denseDF.select((col("*") +: featureCols):_*).show

您的导入语句似乎存在问题。正如您所注意到的,CountVectorizer将使用ml包向量,因此,所有向量导入也应使用此包。确保没有任何使用旧mllib的导入。这包括:

import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.SparseVector
import org.apache.spark.mllib.linalg.DenseVector
有一些方法只存在于mllib包中,因此在实际需要使用这种类型的向量的情况下,可以重命名它们,因为名称与ml向量相同。例如:

import org.apache.spark.mllib.linalg.{Vector => mllibVector}
修复所有导入后,代码应该运行。测试:

val df = Seq((1L, Seq("word1", "word2", "word3")), (2L, Seq("word2", "word4"))).toDF("id", "words")
val countVec = new CountVectorizer().setInputCol("words").setOutputCol("features")
val testDF = countVec.fit(df).transform(df)
将给出一个测试数据帧,如下所示:

+---+--------------------+--------------------+
| id|               words|            features|
+---+--------------------+--------------------+
|  1|[word1, word2, wo...|(4,[0,2,3],[1.0,1...|
|  2|      [word2, word4]| (4,[0,1],[1.0,1.0])|
+---+--------------------+--------------------+
现在为每个索引指定它自己的列:

val vecToSeq = udf((v: Vector) => v.toArray)

val exprs = (0 until 4).map(i => $"features".getItem(i).alias(s"exploded_col$i"))
val df2 = testDF.withColumn("features", vecToSeq($"features")).select(exprs:_*)
结果数据名:

+-------------+-------------+-------------+-------------+
|exploded_col0|exploded_col1|exploded_col2|exploded_col3|
+-------------+-------------+-------------+-------------+
|          1.0|          0.0|          1.0|          1.0|
|          1.0|          1.0|          0.0|          0.0|
+-------------+-------------+-------------+-------------+

你能给我看一下模式吗?什么是分类?拆分?添加了模式。Categorization_split_vec是features列的实际名称,为了简单起见,我对它进行了重新命名,虽然不是所有的地方-现在修复了这是一个奇妙的答案-它确实澄清了语法。现在,我可以使用示例DF来学习您的示例,它似乎可以工作。但当我使用我的DF时,即使我运行你的第一个UDF,我得到错误:由:java.lang.ClassCastException:org.apache.spark.ml.linalg.SparseVector无法转换为org.apache.spark.sql.Row引起-我一直都遇到相同类型的错误。我在尝试许多不同的方法时得到相同的错误-这让我“左右为难”,因为我不知道如何处理它!您能否打印数据框的模式对象,例如testDF.schemafeatures,并将其添加到问题中以供将来参考?看到您也在使用Cassandra连接器,我怀疑存在不同类型的Vector对象。模式是一个普通的|-features:Vector nullable=True,我的df的df.schemafeatures输出是StructFieldfeature,org.apache.spark.ml.linalg。VectorUDT@3bfc3ba7,true-这与testDF的输出完全相同。这使得所有的陌生人,这段代码在其中一个,而不是另一个。感谢这一点。这应该行得通,这正是我一直在尝试的。当我在testDF上运行它时,它工作得很好,但是在我的真实DF上,它有一个由CountVectorier创建的相同sparseVector列,我仍然会遇到错误,无法执行用户定义的FunctionNoFun$1:vector=>array。原因:java.lang.ClassCastException:org.apache.spark.ml.linalg.SparseVector无法强制转换为org.apache.spark.sql.Row。两个DFs的df.structfeatures是相同的StructFieldfeatures,org.apache.spark.ml.linalg。VectorUDT@3bfc3ba7符合事实的我不知道为什么?另外:我所有的导入都很好-错误似乎来自初始的vecToSeq函数和数据格式不匹配。@renegademonkey:How
真正的数据帧看起来像吗?你能把它添加到问题中吗?它太大了,包含了40个特性,包括整数、时间戳、双精度、字符串、数组,但这不应该是相关的?如果我将整个DF过滤到features列,并尝试在该列上运行vecToSeq,则会出现相同的错误。@renegademonkey:那么features列的示例就足够了。如果取该列的5行并运行vecToSeq,它是否仍然给出错误?