Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/295.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/tfs/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何在pyspark.bucketizer中获取分割值而不是桶索引_Python_Apache Spark_Pyspark - Fatal编程技术网

Python 如何在pyspark.bucketizer中获取分割值而不是桶索引

Python 如何在pyspark.bucketizer中获取分割值而不是桶索引,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,在pyspark中使用bucketizer时,我试图获得分割值。 当前结果包含bucket的索引: data = [(0, -1.0), (1, 0.0), (2, 0.5), (3, 1.0), (4, 10.0),(5, 25.0),(6, 100.0),(7, 300.0),(8,float("nan"))] df = spark.createDataFrame(data, ["id", "value"]) splits = [-float("inf"),0,0.001, 1, 5,10,

在pyspark中使用bucketizer时,我试图获得分割值。 当前结果包含bucket的索引:

data = [(0, -1.0), (1, 0.0), (2, 0.5), (3, 1.0), (4, 10.0),(5, 25.0),(6, 100.0),(7, 300.0),(8,float("nan"))]
df = spark.createDataFrame(data, ["id", "value"])
splits = [-float("inf"),0,0.001, 1, 5,10, 20, 30, 40, 50, 60, 70, 80, 90, 100, float("inf")]
result_bucketizer = Bucketizer(splits=splits, inputCol="value",outputCol="result").setHandleInvalid("keep").transform(df)
result_bucketizer.show()
结果是:

+---+-----+------+
| id|value|result|
+---+-----+------+
|  0| -1.0|   0.0|
|  1|  0.0|   1.0|
|  2|  0.5|   2.0|
|  3|  1.0|   3.0|
|  4| 10.0|   5.0|
|  5| 25.0|   6.0|
|  6|100.0|  14.0|
|  7|300.0|  14.0|
|  8|  NaN|  15.0|
+---+-----+------+
我希望结果是:

+---+-----+------+
| id|value|result|
+---+-----+------+
|  0| -1.0|  -inf|
|  1|  0.0|   0.0|
|  2|  0.5| 0.001|
|  3|  1.0|   1.0|
|  4| 10.0|  10.0|
|  5| 25.0|  20.0|
|  6|100.0| 100.0|
|  7|300.0| 100.0|
|  8|  NaN|   NaN|
+---+-----+------+ 

我就是这样做的

首先,我创建了数据帧

from pyspark.ml.feature import Bucketizer
from pyspark.sql.types import StringType

data = [(0, -1.0), (1, 0.0), (2, 0.5), (3, 1.0), (4, 10.0),(5, 25.0),(6, 100.0),(7, 300.0),(8,float("nan"))]
df = spark.createDataFrame(data, ["id", "value"])
splits = [-float("inf"),0,0.001, 1, 5,10, 20, 30, 40, 50, 60, 70, 80, 90, 100, float("inf")]
# here I created a dictionary with {index: name of split}
splits_dict = {i:splits[i] for i in range(len(splits))}
然后我创建了bucketizer作为一个单独的变量

# create bucketizer
bucketizer = Bucketizer(splits=splits, inputCol="value",outputCol="result")
# bucketed dataframe
bucketed = bucketizer.setHandleInvalid('skip').transform(df)
为了获得标签,我使用前面定义的dict应用了replace函数

bucketed = bucketed.replace(to_replace=splits_dict, subset=['result'])
bucketed.show()
输出:

+---+-----+---------+
| id|value|   result|
+---+-----+---------+
|  0| -1.0|-Infinity|
|  1|  0.0|      0.0|
|  2|  0.5|    0.001|
|  3|  1.0|      1.0|
|  4| 10.0|     10.0|
|  5| 25.0|     20.0|
|  6|100.0|    100.0|
|  7|300.0|    100.0|
+---+-----+---------+

谢谢你,丹尼尔。这确实是我想要的,但我想知道是否有一个“火花”的方法来做到这一点-我需要用不同的拆分将多个列bucketize,我正在寻找一种有效的方法来做到这一点,而不使用许多自定义项。我编辑了我的答案。我意识到UDF仅仅是将索引映射到标签就太过了。我定义了一个字典,并使用replace函数来获取标签。我认为这是一个更好的方法。