Python pyspark索引到唯一值_Python_Apache Spark_Pyspark_Apache Spark Sql

Python pyspark索引到唯一值

python apache-spark pyspark

Python pyspark索引到唯一值,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,您需要为唯一值分配索引。 StringIndexer不合适，因为它考虑了字符的频率函数如何分解像这样： activity_start activity_end activity_start_code activity_end_code 0 Stage_0 Stage_3 0 0 1 Stage_3 Stage_5

您需要为唯一值分配索引。 StringIndexer不合适，因为它考虑了字符的频率

函数如何分解

像这样：

        activity_start activity_end       activity_start_code  activity_end_code
0           Stage_0      Stage_3                    0                  0
1           Stage_3      Stage_5                    1                  1
2           Stage_5      Stage_2                    2                  2
3           Stage_2      Stage_7                    3                  3
4           Stage_7          end                    4                  4
5           Stage_0      Stage_2                    0                  2
6           Stage_2      Stage_4                    3                  5
7           Stage_4      Stage_3                    5                  0
8           Stage_3      Stage_8                    1                  6
9           Stage_8          end                    6                  4
43          Stage_0      Stage_2                    0                  2
44          Stage_2      Stage_5                    3                  1
45          Stage_5      Stage_7                    2                  3
46          Stage_7          end                    4                  4
457         Stage_2      Stage_3                    3                  0
458         Stage_3      Stage_8                    1                  6
459         Stage_8          end                    6                  4

这可以通过降低到rdd级别来完成。如果您事先知道唯一行的id，则可以进一步简化该过程。下面是一个示例代码：

rdd = sc.parallelize([(f"Stage{random.randint(0,5)}",f"Stage{random.randint(0,5)}") for _ in range(20)])
schema =StructType([
    StructField("activity_start",StringType(),False),
    StructField("activity_end",StringType(),False)
])
df = spark.createDataFrame(rdd,schema)

df.show()

+--------------+------------+
|activity_start|activity_end|
+--------------+------------+
|        Stage1|      Stage0|
|        Stage1|      Stage5|
|        Stage2|      Stage5|
|        Stage5|      Stage2|
|        Stage2|      Stage0|
|        Stage0|      Stage3|
|        Stage1|      Stage5|
|        Stage1|      Stage0|
|        Stage4|      Stage0|
|        Stage4|      Stage0|
|        Stage5|      Stage5|
|        Stage1|      Stage3|
|        Stage3|      Stage3|
|        Stage5|      Stage1|
|        Stage2|      Stage5|
|        Stage2|      Stage5|
|        Stage3|      Stage3|
|        Stage5|      Stage5|
|        Stage3|      Stage3|
|        Stage3|      Stage3|
+--------------+------------+

acitivity_start = (
    df.select("activity_start")
    .distinct()
    .rdd
    .map(lambda x:(x.activity_start,int(x.activity_start[5:][0])))
)

acitivity_end = (
    df.select("activity_end")
    .distinct()
    .rdd
    .map(lambda x:(x.activity_end,int(x.activity_end[5:][0])+1))
    
)

schema_start = StructType([
    StructField("activity_start",StringType(),False),
    StructField("start_code",IntegerType(),False)
])

schema_end = StructType([
    StructField("activity_end",StringType(),False),
    StructField("end_code",IntegerType(),False)
])

start_df = spark.createDataFrame(acitivity_start,schema_start)
end_df = spark.createDataFrame(acitivity_end,schema_end)
df.join(start_df,["activity_start"], "left").join(end_df,["activity_end"],"left").show()

+------------+--------------+----------+--------+
|activity_end|activity_start|start_code|end_code|
+------------+--------------+----------+--------+
|      Stage5|        Stage5|         5|       6|
|      Stage5|        Stage5|         5|       6|
|      Stage5|        Stage2|         2|       6|
|      Stage5|        Stage2|         2|       6|
|      Stage5|        Stage2|         2|       6|
|      Stage5|        Stage1|         1|       6|
|      Stage5|        Stage1|         1|       6|
|      Stage3|        Stage3|         3|       4|
|      Stage3|        Stage3|         3|       4|
|      Stage3|        Stage3|         3|       4|
|      Stage3|        Stage3|         3|       4|
|      Stage3|        Stage1|         1|       4|
|      Stage3|        Stage0|         0|       4|
|      Stage2|        Stage5|         5|       3|
|      Stage1|        Stage5|         5|       2|
|      Stage0|        Stage2|         2|       1|
|      Stage0|        Stage4|         4|       1|
|      Stage0|        Stage4|         4|       1|
|      Stage0|        Stage1|         1|       1|
|      Stage0|        Stage1|         1|       1|
+------------+--------------+----------+--------+

但是您需要为第一个传入元素提供从0到n的索引。