Apache spark Spark为collect中的每个单词指定一个数字_Apache Spark_Pyspark_Spark Dataframe

Apache spark Spark为collect中的每个单词指定一个数字

apache-spark pyspark

Apache spark Spark为collect中的每个单词指定一个数字,apache-spark,pyspark,spark-dataframe,Apache Spark,Pyspark,Spark Dataframe,我在spark中收集了dataFrame列的数据 temp = df.select('item_code').collect() Result: [Row(item_code=u'I0938'), Row(item_code=u'I0009'), Row(item_code=u'I0010'), Row(item_code=u'I0010'), Row(item_code=u'C0723'), Row(item_code=u'I1097'), Row(item_code=u'C

我在spark中收集了dataFrame列的数据

temp = df.select('item_code').collect()

Result: 

[Row(item_code=u'I0938'),
 Row(item_code=u'I0009'),
 Row(item_code=u'I0010'),
 Row(item_code=u'I0010'),
 Row(item_code=u'C0723'),
 Row(item_code=u'I1097'),
 Row(item_code=u'C0117'),
 Row(item_code=u'I0009'),
 Row(item_code=u'I0009'),
 Row(item_code=u'I0009'),
 Row(item_code=u'I0010'),
 Row(item_code=u'I0009'),
 Row(item_code=u'C0117'),
 Row(item_code=u'I0009'),
 Row(item_code=u'I0596')]

现在我想为每个单词分配一个数字，如果单词是重复的，它有相同的数字。我用的是Spark，RDD，不是熊猫

请帮我解决这个问题

您可以创建具有不同值的新数据帧

val data = temp.distinct()

现在，您可以使用

import org.apache.spark.sql.functions._ 

val dataWithId = data.withColumn("uniqueID",monotonicallyIncreasingId)

现在，您可以将这个新数据帧与原始数据帧连接起来，并选择唯一的id

val tempWithId = temp.join(dataWithId, "item_code").select("item_code", "uniqueID")

代码假定为scala。但pyspark也应该有类似的功能。只需将其视为指针。现在，你能帮我分配一个类型为int-not-bigInt的数字吗？@phongngguyen:看看这里的答案：。您必须将该列强制转换为IntegerType