Python 如何创建整数索引行?
我有一个数据帧:Python 如何创建整数索引行?,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我有一个数据帧: +-----+--------+---------+ |usn |日志|类型|项目|代码| +-----+--------+---------+ |0 | 11 | I0938| |916 | 19 | I0009| |916 | 51 | I1097| |916 | 19 | C0723| |916 | 19 | I0010| |916 | 19 | I0010| |12331 | 19 | C0117| |12331 | 19 | C0117| |12331 | 19 |
+-----+--------+---------+
|usn |日志|类型|项目|代码|
+-----+--------+---------+
|0 | 11 | I0938|
|916 | 19 | I0009|
|916 | 51 | I1097|
|916 | 19 | C0723|
|916 | 19 | I0010|
|916 | 19 | I0010|
|12331 | 19 | C0117|
|12331 | 19 | C0117|
|12331 | 19 | I0009|
|12331 | 19 | I0009|
|12331 | 19 | I0010|
|12838 | 19 | I1067|
|12838 | 19 | I1067|
|12838 | 19 | C1083|
|12838 | 11 | B0250|
|12838 | 19 | C1346|
+-----+--------+---------+
我需要不同的item\u code
,并为每个item\u code
建立索引,如下所示:
+---------+------+
|项目|编号| numId|
+---------+------+
|I0938 | 0|
|I0009 | 1|
|I1097 | 2|
|C0723 | 3|
|I0010 | 4|
|C0117 | 5 |
|1067 | 6|
|C1083 | 7|
|B0250 | 8 |
|C1346 | 9|
+---------+------+
我不使用
单调递增id
,因为它返回一个bigint。使用单数递增id
只保证数字在递增,不保证起始数字和连续编号。如果要确保获得0,1,2,3,。。。您可以使用RDD函数zipWithIndex()
由于我不太熟悉spark与python的结合,下面的示例使用scala,但是转换它应该很容易
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = Seq("I0938","I0009","I1097","C0723","I0010","I0010",
"C0117","C0117","I0009","I0009","I0010","I1067",
"I1067","C1083","B0250","C1346")
.toDF("item_code")
val df2 = df.distinct.rdd
.map{case Row(item: String) => item}
.zipWithIndex()
.toDF("item_code", "numId")
这将为您提供所需的结果:
+---------+-----+
|item_code|numId|
+---------+-----+
| I0010| 0|
| I1067| 1|
| C0117| 2|
| I0009| 3|
| I1097| 4|
| C1083| 5|
| I0938| 6|
| C0723| 7|
| B0250| 8|
| C1346| 9|
+---------+-----+
使用
monotanically\u increasing\u id
只能保证数字在增加,不能保证起始数字和连续编号。如果要确保获得0,1,2,3,。。。您可以使用RDD函数zipWithIndex()
由于我不太熟悉spark与python的结合,下面的示例使用scala,但是转换它应该很容易
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = Seq("I0938","I0009","I1097","C0723","I0010","I0010",
"C0117","C0117","I0009","I0009","I0010","I1067",
"I1067","C1083","B0250","C1346")
.toDF("item_code")
val df2 = df.distinct.rdd
.map{case Row(item: String) => item}
.zipWithIndex()
.toDF("item_code", "numId")
这将为您提供所需的结果:
+---------+-----+
|item_code|numId|
+---------+-----+
| I0010| 0|
| I1067| 1|
| C0117| 2|
| I0009| 3|
| I1097| 4|
| C1083| 5|
| I0938| 6|
| C0723| 7|
| B0250| 8|
| C1346| 9|
+---------+-----+