Python 添加/减去两个pyspark CountVector稀疏向量列

Python 添加/减去两个pyspark CountVector稀疏向量列,python,pyspark,apache-spark-mllib,countvectorizer,Python,Pyspark,Apache Spark Mllib,Countvectorizer,我想取CountVectorizer转换的文档对的差异。换句话说,取两列稀疏向量之间的差值。我将相同的变换器应用于df[doc1]和df[doc2],因此得到的向量对df['X1']-df['X2']的维数将始终保持一致 从pyspark.ml.feature导入RegexTokenizer、countvectorier 从pyspark.ml导入管道 从pyspark.sql.functions导入col df=spark.createDataFrame[homer喜欢甜甜圈.split,甜甜

我想取CountVectorizer转换的文档对的差异。换句话说,取两列稀疏向量之间的差值。我将相同的变换器应用于df[doc1]和df[doc2],因此得到的向量对df['X1']-df['X2']的维数将始终保持一致

从pyspark.ml.feature导入RegexTokenizer、countvectorier 从pyspark.ml导入管道 从pyspark.sql.functions导入col df=spark.createDataFrame[homer喜欢甜甜圈.split,甜甜圈味道鲜美.split,0, 五乘五老板。分开,五是一个数字。分开,1], [单词1、单词2、标签] displaydf cv=计数向量器 union\u words=df.选择col'words 1.别名'words'。uniondf.选择col'words 2.别名'words' cv=计数向量器\ .setInputCol'words'\ Fitu先生的话 df=cv.setInputCol'words1'\ .setOutputCol'X1'\ .transformdf df=cv.setInputCol'words2'\ .setOutputCol'X2'\ .transformdf 显示测向 我无法添加列列类型不匹配,需要数字或日历间隔。我尝试了@zero323,但在isinstancev1,SparseVector上出现了断言错误

df.withColumn("result", (col("X1") + col("X2"))
df.withColumn("result", add(col("X1"), col("X2"))
在稀疏向量格式中,我希望结果是:

[0,11,[2,4,8,9],[1,-1,-1,1]]
[0,11,[0,3,5,6,7,10],[1,1,-1,-1,-1,1]]

需要将函数转换为返回类型为VectorUDT的udf。通过结合解决方案和

将numpy作为np导入 从pyspark.sql.functions导入udf 从pyspark.ml.linalg导入SparseVector、Vectors、VectorUDT 从pyspark.ml.feature导入RegexTokenizer、countvectorier 从pyspark.ml导入管道 从pyspark.sql.functions导入col df=spark.createDataFramedata=[homer喜欢甜甜圈.split,[donuts,taste,delicious],0, [five,by,five,boss],[five,is,a,number],1], schema=[words1,words2,label] cv=计数向量器 union\u words=df.选择col'words 1.别名'words'。uniondf.选择col'words 2.别名'words' cv=计数向量器\ .setInputCol'words'\ Fitu先生的话 df=cv.setInputCol'words1'\ .setOutputCol'X1'\ .transformdf df=cv.setInputCol'words2'\ .setOutputCol'X2'\ .transformdf @udfVectorUDT def最小值v1、v2: 稀疏向量将变得稠密 断言isinstancev1,SparseVector和isinstancev2,SparseVector 断言v1.size==v2.size 计算指数的并集 index=setv1.index.unionsetv2.index 不是特别有效,但我们受到SPARK-10973的限制 创建索引:值dicts v1d=dictzipv1.index,v1.value v2d=dictzipv2.index,v2.value 零=np.640 创建字典索引:v1[index]-v2[index] 值={i:v1d.geti,零-v2d.geti,零 指数中的i 如果v1d.geti,zero-v2d.geti,zero!=zero} 返回Vectors.sparsev1.size,值 df=df.列为'NAME_X_DIFF',减去'X1','X2' displaydf
需要将函数转换为返回类型为VectorUDT的udf。通过结合解决方案和

将numpy作为np导入 从pyspark.sql.functions导入udf 从pyspark.ml.linalg导入SparseVector、Vectors、VectorUDT 从pyspark.ml.feature导入RegexTokenizer、countvectorier 从pyspark.ml导入管道 从pyspark.sql.functions导入col df=spark.createDataFramedata=[homer喜欢甜甜圈.split,[donuts,taste,delicious],0, [five,by,five,boss],[five,is,a,number],1], schema=[words1,words2,label] cv=计数向量器 union\u words=df.选择col'words 1.别名'words'。uniondf.选择col'words 2.别名'words' cv=计数向量器\ .setInputCol'words'\ Fitu先生的话 df=cv.setInputCol'words1'\ .setOutputCol'X1'\ .transformdf df=cv.setInputCol'words2'\ .setOutputCol'X2'\ .transformdf @udfVectorUDT def最小值v1、v2: 稀疏向量将变得稠密 断言isinstancev1,SparseVector和isinstancev2,SparseVector 断言v1.size==v2.size 计算指数的并集 index=setv1.index.unionsetv2.index 不是特别有效,但我们受到SPARK-10973的限制 创建索引:值dicts v1d=dictzipv1.index,v1.value v2d=dictzipv2.index,v2.value 零=np.640 创建字典索引:v1[index]-v2[index] 值={i:v1d.geti,零-v2d.geti,零 指数中的i 如果v1d.geti,zero-v2d.geti,zero!=zero} 返回Vectors.sparsev1.size,值 df=df.列为'NAME_X_DIFF',减去'X1','X2' displaydf
[0,11,[2,4,8,9],[1,-1,-1,1]]
[0,11,[0,3,5,6,7,10],[1,1,-1,-1,-1,1]]
+----------------------+--------------------------+-----+---------------------------+--------------------------------+-----------------------------------------------+
|words1                |words2                    |label|X1                         |X2                              |X_DIFF                                         |
+----------------------+--------------------------+-----+---------------------------+--------------------------------+-----------------------------------------------+
|[homer, likes, donuts]|[donuts, taste, delicious]|0    |(11,[1,4,10],[1.0,1.0,1.0])|(11,[1,3,5],[1.0,1.0,1.0])      |(11,[3,4,5,10],[-1.0,1.0,-1.0,1.0])            |
|[five, by, five, boss]|[five, is, a, number]     |1    |(11,[0,6,9],[2.0,1.0,1.0]) |(11,[0,2,7,8],[1.0,1.0,1.0,1.0])|(11,[0,2,6,7,8,9],[1.0,-1.0,1.0,-1.0,-1.0,1.0])|
+----------------------+--------------------------+-----+---------------------------+--------------------------------+-----------------------------------------------+