Apache spark pyspark-1列内行(双)的argmax
我有以下情况Apache spark pyspark-1列内行(双)的argmax,apache-spark,pyspark,apache-spark-sql,rdd,Apache Spark,Pyspark,Apache Spark Sql,Rdd,我有以下情况 +--------------------+ | p| +--------------------+ |[0.99998416412131...| |[0.99998416412131...| |[0.99998416412131...| |[0.99998416412131...| |[0.99998416412131...| +--------------------+ 这是行对象的列表 [Row(p=[0.9999841641213133
+--------------------+
| p|
+--------------------+
|[0.99998416412131...|
|[0.99998416412131...|
|[0.99998416412131...|
|[0.99998416412131...|
|[0.99998416412131...|
+--------------------+
这是行对象的列表
[Row(p=[0.9999841641213133, 5.975696995141415e-06, 1.3699249952858219e-06, 1.4817184271708493e-06, 2.9022272149130313e-07, 1.4883436072406822e-06, 2.2234697862933896e-06, 3.006502154124559e-06]),
Row(p=[0.9999841641213133, 5.975696995141415e-06, 1.3699249952858219e-06, 1.4817184271708493e-06, 2.9022272149130313e-07, 1.4883436072406822e-06, 2.2234697862933896e-06, 3.006502154124559e-06]),
Row(p=[0.9999841641213133, 5.975696995141415e-06, 1.3699249952858219e-06, 1.4817184271708493e-06, 2.9022272149130313e-07, 1.4883436072406822e-06, 2.2234697862933896e-06, 3.006502154124559e-06]),
Row(p=[0.9999841641213133, 5.975696995141415e-06, 1.3699249952858219e-06, 1.4817184271708493e-06, 2.9022272149130313e-07, 1.4883436072406822e-06, 2.2234697862933896e-06, 3.006502154124559e-06]),
Row(p=[0.9999841641213133, 5.975696995141415e-06, 1.3699249952858219e-06, 1.4817184271708493e-06, 2.9022272149130313e-07, 1.4883436072406822e-06, 2.2234697862933896e-06, 3.006502154124559e-06])]
我正在尝试将此列筛选到一个名为maxClass的新列中,该列为所有行返回np.argmaxrow[0]。下面是我的最佳尝试,但我无法获得使用此软件包的语言学知识
def f(row):
return np.argmax(np.array(row.p))[0]
results=probs.rdd.map(lambda x:f(x))
results
为了完整性,正如pault所建议的,这里有一个不使用UDF和numpy的解决方案。而是使用阵列位置和阵列最大值: 导入pyspark.sql.f函数 df=spark.createDataFrame[ [0.99998411641213133,5.975696995141415e-06,1.3699249952858219e-06,1.4817184271708493e-06,2.90222714913013E-07,1.4883436072406822e-06,2.2234697862933896e-06,3.0065021524559E-06],, [0.9999841641213134,0.99999,1.3699249952858219e-06,1.4817184271708493e-06,2.90222714913013E-07,1.4883436072406822e-06,2.2234697862933896e-06,3.006502154559E-06],, [0.99998411641213135、5.975696995141415e-06、1.3699249952858219e-06、1.4817184271708493e-06、2.90222714913013E-07、1.4883436072406822e-06、2.2234697862933896e-06、3.0065021524559E-06],]\ .toDFp 选择 f、 expr'array_positioncastp作为数组,castarray_maxp作为小数16,16-1'。别名max_indx 显示 +----+ |马克斯·因克斯| +----+ | 0| | 1| | 0| +----+
从f函数中删除[0]对您有用吗?如果不确切知道您想要的输出是什么,很难判断。抱歉,它太模糊了。我只想要与最大值关联的索引。我越来越近了,但还是不成功。从pyspark.sql.functions导入pyspark.sql.functions作为f从pyspark.sql.functions导入udf def custom_function行:返回np.arrayrow[p].argmax udf_custom_function=udfcustom_function new=probs.withColumnp_max,udf_custom_function不会将[已解决]添加到您的问题中。相反,如果您认为自己的解决方案对其他人有用,请将其作为答案发布,或者删除该问题。此外,如果您知道数组列的大小,您也可以这样做,因为这看起来像是一个多类分类问题。这与我想要的非常接近,但我需要与之相关的索引,而不是值本身。由于这些是与类相关联的概率,我有一个索引类映射到与之相关联的不同分类变量的查找表