Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/363.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何创建整数索引行?_Python_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Python 如何创建整数索引行?

Python 如何创建整数索引行?,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我有一个数据帧: +-----+--------+---------+ |usn |日志|类型|项目|代码| +-----+--------+---------+ |0 | 11 | I0938| |916 | 19 | I0009| |916 | 51 | I1097| |916 | 19 | C0723| |916 | 19 | I0010| |916 | 19 | I0010| |12331 | 19 | C0117| |12331 | 19 | C0117| |12331 | 19 |

我有一个数据帧:

+-----+--------+---------+
|usn |日志|类型|项目|代码|
+-----+--------+---------+
|0 | 11 | I0938|
|916 | 19 | I0009|
|916 | 51 | I1097|
|916 | 19 | C0723|
|916 | 19 | I0010|
|916 | 19 | I0010|
|12331 | 19 | C0117|
|12331 | 19 | C0117|
|12331 | 19 | I0009|
|12331 | 19 | I0009|
|12331 | 19 | I0010|
|12838 | 19 | I1067|
|12838 | 19 | I1067|
|12838 | 19 | C1083|
|12838 | 11 | B0250|
|12838 | 19 | C1346|
+-----+--------+---------+
我需要不同的
item\u code
,并为每个
item\u code
建立索引,如下所示:

+---------+------+
|项目|编号| numId|
+---------+------+
|I0938 | 0|
|I0009 | 1|
|I1097 | 2|
|C0723 | 3|
|I0010 | 4|
|C0117 | 5 |
|1067 | 6|
|C1083 | 7|
|B0250 | 8 |
|C1346 | 9|
+---------+------+

我不使用
单调递增id
,因为它返回一个bigint。

使用
单数递增id
只保证数字在递增,不保证起始数字和连续编号。如果要确保获得0,1,2,3,。。。您可以使用RDD函数
zipWithIndex()

由于我不太熟悉spark与python的结合,下面的示例使用scala,但是转换它应该很容易

val spark = SparkSession.builder.getOrCreate()
import spark.implicits._

val df = Seq("I0938","I0009","I1097","C0723","I0010","I0010",
    "C0117","C0117","I0009","I0009","I0010","I1067",
    "I1067","C1083","B0250","C1346")
  .toDF("item_code")

val df2 = df.distinct.rdd
  .map{case Row(item: String) => item}
  .zipWithIndex()
  .toDF("item_code", "numId")
这将为您提供所需的结果:

+---------+-----+
|item_code|numId|
+---------+-----+
|    I0010|    0|
|    I1067|    1|
|    C0117|    2|
|    I0009|    3|
|    I1097|    4|
|    C1083|    5|
|    I0938|    6|
|    C0723|    7|
|    B0250|    8|
|    C1346|    9|
+---------+-----+

使用
monotanically\u increasing\u id
只能保证数字在增加,不能保证起始数字和连续编号。如果要确保获得0,1,2,3,。。。您可以使用RDD函数
zipWithIndex()

由于我不太熟悉spark与python的结合,下面的示例使用scala,但是转换它应该很容易

val spark = SparkSession.builder.getOrCreate()
import spark.implicits._

val df = Seq("I0938","I0009","I1097","C0723","I0010","I0010",
    "C0117","C0117","I0009","I0009","I0010","I1067",
    "I1067","C1083","B0250","C1346")
  .toDF("item_code")

val df2 = df.distinct.rdd
  .map{case Row(item: String) => item}
  .zipWithIndex()
  .toDF("item_code", "numId")
这将为您提供所需的结果:

+---------+-----+
|item_code|numId|
+---------+-----+
|    I0010|    0|
|    I1067|    1|
|    C0117|    2|
|    I0009|    3|
|    I1097|    4|
|    C1083|    5|
|    I0938|    6|
|    C0723|    7|
|    B0250|    8|
|    C1346|    9|
+---------+-----+