Python 使用矢量汇编程序计算平均值和最大值_Python_Pyspark

Python 使用矢量汇编程序计算平均值和最大值

python pyspark

Python 使用矢量汇编程序计算平均值和最大值,python,pyspark,Python,Pyspark,我使用的是数据帧，类似于： from pyspark.mllib.linalg import Vectors from pyspark.ml.feature import VectorAssembler from pyspark.sql.types import * schema = StructType([ StructField("ClientId", IntegerType(), True), StructField("m_ant21", IntegerType(),

我使用的是数据帧，类似于：

from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler


from pyspark.sql.types import *

schema = StructType([
    StructField("ClientId", IntegerType(), True),
    StructField("m_ant21", IntegerType(), True),
    StructField("m_ant22", IntegerType(), True),
    StructField("m_ant23", IntegerType(), True),
    StructField("m_ant24", IntegerType(), True)
])

df = sqlContext.createDataFrame(
                             data=[(0, 5, 5, 4, 0),
                                   (1, 23, 13, 17, 99),
                                   (2, 0, 0, 0, 1),
                                   (3, 0, 4, 1, 0),
                                   (4, 2, 1, 30, 10),
                                   (5, 0, 0, 0, 0)],
                                   schema=schema)

我需要计算每行的平均值和最大值，并使用列“m_ant21”、“m_ant22”、“m_ant23”、“m_ant24”

我正在尝试使用vectorAssembler：

assembler = VectorAssembler(
    inputCols=["m_ant21", "m_ant22", "m_ant23","m_ant24"],
    outputCol="muestra")
output = assembler.transform(df)
output.show()

现在，我创建了一个函数来计算平均值，但输入变量是一个名为“dv”的“DenseVector”：

与最大值相同：

def mi_Max( dv ) :
        return float(max( dv )  )   
udf_max  = udf( mi_Max, DoubleType() )
output2 = output.withColumn( "maxVec",  udf_max ( output.muestra ) )
output2.show()

问题在于output1.show（）和output2.show（）中的错误。只是它不工作，我不知道代码会发生什么。我做错了什么？

请帮帮我。

我已经试过了，检查一下

from pyspark.sql import functions as F

df.show()
+--------+-------+-------+-------+-------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|
+--------+-------+-------+-------+-------+
|       0|      5|      5|      4|      0|
|       1|     23|     13|     17|     99|
|       2|      0|      0|      0|      1|
|       3|      0|      4|      1|      0|
|       4|      2|      1|     30|     10|
|       5|      0|      0|      0|      0|
+--------+-------+-------+-------+-------+

df1 = df.withColumn('mean',sum(df[c] for c in df.columns[1:])/len(df.columns[1:]))
df1 = df1.withColumn('max',F.greatest(*[F.coalesce(df[c],F.lit(0)) for c in df.columns[1:]]))

df1.show()

+--------+-------+-------+-------+-------+-----+---+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24| mean|max|
+--------+-------+-------+-------+-------+-----+---+
|       0|      5|      5|      4|      0|  3.5|  5|
|       1|     23|     13|     17|     99| 38.0| 99|
|       2|      0|      0|      0|      1| 0.25|  1|
|       3|      0|      4|      1|      0| 1.25|  4|
|       4|      2|      1|     30|     10|10.75| 30|
|       5|      0|      0|      0|      0|  0.0|  0|
+--------+-------+-------+-------+-------+-----+---+

可以使用DenseVector，但可以采用RDD方式：

output2 = output.rdd.map(lambda x: (x.ClientId, 
                                   x.m_ant21, 
                                   x.m_ant22,
                                   x.m_ant23,
                                   x.m_ant24,
                                   x.muestra, 
                                   float(max(x.muestra))))
output2 = spark.createDataFrame(output2)
output2.show()

其中：

+---+---+---+---+---+--------------------+----+
| _1| _2| _3| _4| _5|                  _6|  _7|
+---+---+---+---+---+--------------------+----+
|  0|  5|  5|  4|  0|   [5.0,5.0,4.0,0.0]| 5.0|
|  1| 23| 13| 17| 99|[23.0,13.0,17.0,9...|99.0|
|  2|  0|  0|  0|  1|       (4,[3],[1.0])| 1.0|
|  3|  0|  4|  1|  0|   [0.0,4.0,1.0,0.0]| 4.0|
|  4|  2|  1| 30| 10| [2.0,1.0,30.0,10.0]|30.0|
|  5|  0|  0|  0|  0|           (4,[],[])| 0.0|
+---+---+---+---+---+--------------------+----+

现在剩下的就是重命名列，例如使用

withColumnRename

函数。平均情况是相同的

也可以使用

SparseVector

执行此操作，但在这种情况下，有必要访问：

如果df有很多列，并且不可能在VectorAssembler阶段之前计算最大值，则这种方法效果更好

我找到了解决这个问题的办法

import pyspark.sql.functions  as f
import pyspark.sql.types as t

min_of_vector = f.udf(lambda vec: vec.toArray().min(), t.DoubleType())

max_of_vector = f.udf(lambda vec: vec.toArray().max(), t.DoubleType())

mean_of_vector = f.udf(lambda vec: vec.toArray().mean(), t.DoubleType())

final = output.withColumn('min', min_of_vector('muestra')) \
        .withColumn('max', max_of_vector('muestra')) \
        .withColumn('mean', mean_of_vector('muestra'))

对于每一行，您需要平均值和最大值，或者对于每一列？是。我需要两个新的列来表示每行的值。为什么需要向量汇编程序呢？我有另一种方法，但效率不高，因为我有很多数据！例如，在最大情况下，您可以使用rdd获取每行向量的最大值：

output.rdd.map（lambda x:（np.max（x.muestra.toArray（））

），然后将其与转换为rdd的

output

数据帧连接，其中

np

是NumPy别名。你试过了吗？只是一个问题：有了F.greatest（*[F.coalesce，我如何使用这些函数？我已经编辑了导入函数包的答案。它工作正常，而且很清楚。我想用vectorAssembler实现它，但这是一个很好的解决方案。谢谢！

output2 = output.rdd.map(lambda x: (x.ClientId, 
                                       x.m_ant21, 
                                       x.m_ant22,
                                       x.m_ant23,
                                       x.m_ant24,
                                       x.muestra, 
                                       float(max(x.muestra.values))))
output2 = spark.createDataFrame(output2)

import pyspark.sql.functions  as f
import pyspark.sql.types as t

min_of_vector = f.udf(lambda vec: vec.toArray().min(), t.DoubleType())

max_of_vector = f.udf(lambda vec: vec.toArray().max(), t.DoubleType())

mean_of_vector = f.udf(lambda vec: vec.toArray().mean(), t.DoubleType())

final = output.withColumn('min', min_of_vector('muestra')) \
        .withColumn('max', max_of_vector('muestra')) \
        .withColumn('mean', mean_of_vector('muestra'))