Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark pyspark-1列内行(双)的argmax_Apache Spark_Pyspark_Apache Spark Sql_Rdd - Fatal编程技术网

Apache spark pyspark-1列内行(双)的argmax

Apache spark pyspark-1列内行(双)的argmax,apache-spark,pyspark,apache-spark-sql,rdd,Apache Spark,Pyspark,Apache Spark Sql,Rdd,我有以下情况 +--------------------+ | p| +--------------------+ |[0.99998416412131...| |[0.99998416412131...| |[0.99998416412131...| |[0.99998416412131...| |[0.99998416412131...| +--------------------+ 这是行对象的列表 [Row(p=[0.9999841641213133

我有以下情况

+--------------------+
|                   p|
+--------------------+
|[0.99998416412131...|
|[0.99998416412131...|
|[0.99998416412131...|
|[0.99998416412131...|
|[0.99998416412131...|
+--------------------+
这是行对象的列表

[Row(p=[0.9999841641213133, 5.975696995141415e-06, 1.3699249952858219e-06, 1.4817184271708493e-06, 2.9022272149130313e-07, 1.4883436072406822e-06, 2.2234697862933896e-06, 3.006502154124559e-06]),
 Row(p=[0.9999841641213133, 5.975696995141415e-06, 1.3699249952858219e-06, 1.4817184271708493e-06, 2.9022272149130313e-07, 1.4883436072406822e-06, 2.2234697862933896e-06, 3.006502154124559e-06]),
 Row(p=[0.9999841641213133, 5.975696995141415e-06, 1.3699249952858219e-06, 1.4817184271708493e-06, 2.9022272149130313e-07, 1.4883436072406822e-06, 2.2234697862933896e-06, 3.006502154124559e-06]),
 Row(p=[0.9999841641213133, 5.975696995141415e-06, 1.3699249952858219e-06, 1.4817184271708493e-06, 2.9022272149130313e-07, 1.4883436072406822e-06, 2.2234697862933896e-06, 3.006502154124559e-06]),
 Row(p=[0.9999841641213133, 5.975696995141415e-06, 1.3699249952858219e-06, 1.4817184271708493e-06, 2.9022272149130313e-07, 1.4883436072406822e-06, 2.2234697862933896e-06, 3.006502154124559e-06])]
我正在尝试将此列筛选到一个名为maxClass的新列中,该列为所有行返回np.argmaxrow[0]。下面是我的最佳尝试,但我无法获得使用此软件包的语言学知识

def f(row):
    return np.argmax(np.array(row.p))[0]
results=probs.rdd.map(lambda x:f(x))
results

为了完整性,正如pault所建议的,这里有一个不使用UDF和numpy的解决方案。而是使用阵列位置和阵列最大值:

导入pyspark.sql.f函数 df=spark.createDataFrame[ [0.99998411641213133,5.975696995141415e-06,1.3699249952858219e-06,1.4817184271708493e-06,2.90222714913013E-07,1.4883436072406822e-06,2.2234697862933896e-06,3.0065021524559E-06],, [0.9999841641213134,0.99999,1.3699249952858219e-06,1.4817184271708493e-06,2.90222714913013E-07,1.4883436072406822e-06,2.2234697862933896e-06,3.006502154559E-06],, [0.99998411641213135、5.975696995141415e-06、1.3699249952858219e-06、1.4817184271708493e-06、2.90222714913013E-07、1.4883436072406822e-06、2.2234697862933896e-06、3.0065021524559E-06],]\ .toDFp 选择 f、 expr'array_positioncastp作为数组,castarray_maxp作为小数16,16-1'。别名max_indx 显示 +----+ |马克斯·因克斯| +----+ | 0| | 1| | 0| +----+
从f函数中删除[0]对您有用吗?如果不确切知道您想要的输出是什么,很难判断。抱歉,它太模糊了。我只想要与最大值关联的索引。我越来越近了,但还是不成功。从pyspark.sql.functions导入pyspark.sql.functions作为f从pyspark.sql.functions导入udf def custom_function行:返回np.arrayrow[p].argmax udf_custom_function=udfcustom_function new=probs.withColumnp_max,udf_custom_function不会将[已解决]添加到您的问题中。相反,如果您认为自己的解决方案对其他人有用,请将其作为答案发布,或者删除该问题。此外,如果您知道数组列的大小,您也可以这样做,因为这看起来像是一个多类分类问题。这与我想要的非常接近,但我需要与之相关的索引,而不是值本身。由于这些是与类相关联的概率,我有一个索引类映射到与之相关联的不同分类变量的查找表