Python 如何创建整数索引行？_Python_Apache Spark_Pyspark_Apache Spark Sql

Python 如何创建整数索引行？

python apache-spark pyspark

Python 如何创建整数索引行？,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我有一个数据帧： +-----+--------+---------+ |usn |日志|类型|项目|代码| +-----+--------+---------+ |0 | 11 | I0938| |916 | 19 | I0009| |916 | 51 | I1097| |916 | 19 | C0723| |916 | 19 | I0010| |916 | 19 | I0010| |12331 | 19 | C0117| |12331 | 19 | C0117| |12331 | 19 |

我有一个数据帧：

+-----+--------+---------+
|usn |日志|类型|项目|代码|
+-----+--------+---------+
|0 | 11 | I0938|
|916 | 19 | I0009|
|916 | 51 | I1097|
|916 | 19 | C0723|
|916 | 19 | I0010|
|916 | 19 | I0010|
|12331 | 19 | C0117|
|12331 | 19 | C0117|
|12331 | 19 | I0009|
|12331 | 19 | I0009|
|12331 | 19 | I0010|
|12838 | 19 | I1067|
|12838 | 19 | I1067|
|12838 | 19 | C1083|
|12838 | 11 | B0250|
|12838 | 19 | C1346|
+-----+--------+---------+

我需要不同的

item\u code

，并为每个

item\u code

建立索引，如下所示：

+---------+------+
|项目|编号| numId|
+---------+------+
|I0938 | 0|
|I0009 | 1|
|I1097 | 2|
|C0723 | 3|
|I0010 | 4|
|C0117 | 5 |
|1067 | 6|
|C1083 | 7|
|B0250 | 8 |
|C1346 | 9|
+---------+------+

我不使用

单调递增id

，因为它返回一个bigint。

使用

单数递增id

只保证数字在递增，不保证起始数字和连续编号。如果要确保获得0,1,2,3，。。。您可以使用RDD函数

zipWithIndex（）

由于我不太熟悉spark与python的结合，下面的示例使用scala，但是转换它应该很容易

val spark = SparkSession.builder.getOrCreate()
import spark.implicits._

val df = Seq("I0938","I0009","I1097","C0723","I0010","I0010",
    "C0117","C0117","I0009","I0009","I0010","I1067",
    "I1067","C1083","B0250","C1346")
  .toDF("item_code")

val df2 = df.distinct.rdd
  .map{case Row(item: String) => item}
  .zipWithIndex()
  .toDF("item_code", "numId")

这将为您提供所需的结果：

+---------+-----+
|item_code|numId|
+---------+-----+
|    I0010|    0|
|    I1067|    1|
|    C0117|    2|
|    I0009|    3|
|    I1097|    4|
|    C1083|    5|
|    I0938|    6|
|    C0723|    7|
|    B0250|    8|
|    C1346|    9|
+---------+-----+

使用

monotanically\u increasing\u id

只能保证数字在增加，不能保证起始数字和连续编号。如果要确保获得0,1,2,3，。。。您可以使用RDD函数

zipWithIndex（）

由于我不太熟悉spark与python的结合，下面的示例使用scala，但是转换它应该很容易

val spark = SparkSession.builder.getOrCreate()
import spark.implicits._

val df = Seq("I0938","I0009","I1097","C0723","I0010","I0010",
    "C0117","C0117","I0009","I0009","I0010","I1067",
    "I1067","C1083","B0250","C1346")
  .toDF("item_code")

val df2 = df.distinct.rdd
  .map{case Row(item: String) => item}
  .zipWithIndex()
  .toDF("item_code", "numId")

这将为您提供所需的结果：

+---------+-----+
|item_code|numId|
+---------+-----+
|    I0010|    0|
|    I1067|    1|
|    C0117|    2|
|    I0009|    3|
|    I1097|    4|
|    C1083|    5|
|    I0938|    6|
|    C0723|    7|
|    B0250|    8|
|    C1346|    9|
+---------+-----+