Python pyspark索引到唯一值
您需要为唯一值分配索引。 StringIndexer不合适,因为它考虑了字符的频率 函数如何分解 像这样:Python pyspark索引到唯一值,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,您需要为唯一值分配索引。 StringIndexer不合适,因为它考虑了字符的频率 函数如何分解 像这样: activity_start activity_end activity_start_code activity_end_code 0 Stage_0 Stage_3 0 0 1 Stage_3 Stage_5
activity_start activity_end activity_start_code activity_end_code
0 Stage_0 Stage_3 0 0
1 Stage_3 Stage_5 1 1
2 Stage_5 Stage_2 2 2
3 Stage_2 Stage_7 3 3
4 Stage_7 end 4 4
5 Stage_0 Stage_2 0 2
6 Stage_2 Stage_4 3 5
7 Stage_4 Stage_3 5 0
8 Stage_3 Stage_8 1 6
9 Stage_8 end 6 4
43 Stage_0 Stage_2 0 2
44 Stage_2 Stage_5 3 1
45 Stage_5 Stage_7 2 3
46 Stage_7 end 4 4
457 Stage_2 Stage_3 3 0
458 Stage_3 Stage_8 1 6
459 Stage_8 end 6 4
这可以通过降低到rdd级别来完成。如果您事先知道唯一行的id,则可以进一步简化该过程。下面是一个示例代码:
rdd = sc.parallelize([(f"Stage{random.randint(0,5)}",f"Stage{random.randint(0,5)}") for _ in range(20)])
schema =StructType([
StructField("activity_start",StringType(),False),
StructField("activity_end",StringType(),False)
])
df = spark.createDataFrame(rdd,schema)
df.show()
+--------------+------------+
|activity_start|activity_end|
+--------------+------------+
| Stage1| Stage0|
| Stage1| Stage5|
| Stage2| Stage5|
| Stage5| Stage2|
| Stage2| Stage0|
| Stage0| Stage3|
| Stage1| Stage5|
| Stage1| Stage0|
| Stage4| Stage0|
| Stage4| Stage0|
| Stage5| Stage5|
| Stage1| Stage3|
| Stage3| Stage3|
| Stage5| Stage1|
| Stage2| Stage5|
| Stage2| Stage5|
| Stage3| Stage3|
| Stage5| Stage5|
| Stage3| Stage3|
| Stage3| Stage3|
+--------------+------------+
acitivity_start = (
df.select("activity_start")
.distinct()
.rdd
.map(lambda x:(x.activity_start,int(x.activity_start[5:][0])))
)
acitivity_end = (
df.select("activity_end")
.distinct()
.rdd
.map(lambda x:(x.activity_end,int(x.activity_end[5:][0])+1))
)
schema_start = StructType([
StructField("activity_start",StringType(),False),
StructField("start_code",IntegerType(),False)
])
schema_end = StructType([
StructField("activity_end",StringType(),False),
StructField("end_code",IntegerType(),False)
])
start_df = spark.createDataFrame(acitivity_start,schema_start)
end_df = spark.createDataFrame(acitivity_end,schema_end)
df.join(start_df,["activity_start"], "left").join(end_df,["activity_end"],"left").show()
+------------+--------------+----------+--------+
|activity_end|activity_start|start_code|end_code|
+------------+--------------+----------+--------+
| Stage5| Stage5| 5| 6|
| Stage5| Stage5| 5| 6|
| Stage5| Stage2| 2| 6|
| Stage5| Stage2| 2| 6|
| Stage5| Stage2| 2| 6|
| Stage5| Stage1| 1| 6|
| Stage5| Stage1| 1| 6|
| Stage3| Stage3| 3| 4|
| Stage3| Stage3| 3| 4|
| Stage3| Stage3| 3| 4|
| Stage3| Stage3| 3| 4|
| Stage3| Stage1| 1| 4|
| Stage3| Stage0| 0| 4|
| Stage2| Stage5| 5| 3|
| Stage1| Stage5| 5| 2|
| Stage0| Stage2| 2| 1|
| Stage0| Stage4| 4| 1|
| Stage0| Stage4| 4| 1|
| Stage0| Stage1| 1| 1|
| Stage0| Stage1| 1| 1|
+------------+--------------+----------+--------+
但是您需要为第一个传入元素提供从0到n的索引。