Apache spark pyspark-1列内行（双）的argmax_Apache Spark_Pyspark_Apache Spark Sql_Rdd

Apache spark pyspark-1列内行（双）的argmax

apache-spark pyspark

Apache spark pyspark-1列内行（双）的argmax,apache-spark,pyspark,apache-spark-sql,rdd,Apache Spark,Pyspark,Apache Spark Sql,Rdd,我有以下情况 +--------------------+ | p| +--------------------+ |[0.99998416412131...| |[0.99998416412131...| |[0.99998416412131...| |[0.99998416412131...| |[0.99998416412131...| +--------------------+ 这是行对象的列表 [Row(p=[0.9999841641213133

我有以下情况

+--------------------+
|                   p|
+--------------------+
|[0.99998416412131...|
|[0.99998416412131...|
|[0.99998416412131...|
|[0.99998416412131...|
|[0.99998416412131...|
+--------------------+

这是行对象的列表

[Row(p=[0.9999841641213133, 5.975696995141415e-06, 1.3699249952858219e-06, 1.4817184271708493e-06, 2.9022272149130313e-07, 1.4883436072406822e-06, 2.2234697862933896e-06, 3.006502154124559e-06]),
 Row(p=[0.9999841641213133, 5.975696995141415e-06, 1.3699249952858219e-06, 1.4817184271708493e-06, 2.9022272149130313e-07, 1.4883436072406822e-06, 2.2234697862933896e-06, 3.006502154124559e-06]),
 Row(p=[0.9999841641213133, 5.975696995141415e-06, 1.3699249952858219e-06, 1.4817184271708493e-06, 2.9022272149130313e-07, 1.4883436072406822e-06, 2.2234697862933896e-06, 3.006502154124559e-06]),
 Row(p=[0.9999841641213133, 5.975696995141415e-06, 1.3699249952858219e-06, 1.4817184271708493e-06, 2.9022272149130313e-07, 1.4883436072406822e-06, 2.2234697862933896e-06, 3.006502154124559e-06]),
 Row(p=[0.9999841641213133, 5.975696995141415e-06, 1.3699249952858219e-06, 1.4817184271708493e-06, 2.9022272149130313e-07, 1.4883436072406822e-06, 2.2234697862933896e-06, 3.006502154124559e-06])]

我正在尝试将此列筛选到一个名为maxClass的新列中，该列为所有行返回np.argmaxrow[0]。下面是我的最佳尝试，但我无法获得使用此软件包的语言学知识

def f(row):
    return np.argmax(np.array(row.p))[0]
results=probs.rdd.map(lambda x:f(x))
results

为了完整性，正如pault所建议的，这里有一个不使用UDF和numpy的解决方案。而是使用阵列位置和阵列最大值：

导入pyspark.sql.f函数 df=spark.createDataFrame[ [0.99998411641213133，5.975696995141415e-06，1.3699249952858219e-06，1.4817184271708493e-06，2.90222714913013E-07，1.4883436072406822e-06，2.2234697862933896e-06，3.0065021524559E-06]，， [0.9999841641213134，0.99999，1.3699249952858219e-06，1.4817184271708493e-06，2.90222714913013E-07，1.4883436072406822e-06，2.2234697862933896e-06，3.006502154559E-06]，， [0.99998411641213135、5.975696995141415e-06、1.3699249952858219e-06、1.4817184271708493e-06、2.90222714913013E-07、1.4883436072406822e-06、2.2234697862933896e-06、3.0065021524559E-06]，]\ .toDFp 选择 f、 expr'array_positioncastp作为数组，castarray_maxp作为小数16，16-1'。别名max_indx 显示 +----+ |马克斯·因克斯| +----+ | 0| | 1| | 0| +----+

从f函数中删除[0]对您有用吗？如果不确切知道您想要的输出是什么，很难判断。抱歉，它太模糊了。我只想要与最大值关联的索引。我越来越近了，但还是不成功。从pyspark.sql.functions导入pyspark.sql.functions作为f从pyspark.sql.functions导入udf def custom_function行：返回np.arrayrow[p].argmax udf_custom_function=udfcustom_function new=probs.withColumnp_max，udf_custom_function不会将[已解决]添加到您的问题中。相反，如果您认为自己的解决方案对其他人有用，请将其作为答案发布，或者删除该问题。此外，如果您知道数组列的大小，您也可以这样做，因为这看起来像是一个多类分类问题。这与我想要的非常接近，但我需要与之相关的索引，而不是值本身。由于这些是与类相关联的概率，我有一个索引类映射到与之相关联的不同分类变量的查找表