Dataframe PySpark:如何使用'StringIndexer'对字符串数组列进行标签编码
正如我们所知,我们可以通过Dataframe PySpark:如何使用'StringIndexer'对字符串数组列进行标签编码,dataframe,apache-spark,pyspark,apache-spark-sql,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,正如我们所知,我们可以通过StringIndexer在string列中执行LabelEncoder(),但是如果要在string数组列中执行LabelEncoder(),这并不容易实现 # input df.show() +--------------------------------------+ | tags| +--------------------------------------+ | [indus
StringIndexer
在string列中执行LabelEncoder()
,但是如果要在string数组列中执行LabelEncoder()
,这并不容易实现
# input
df.show()
+--------------------------------------+
| tags|
+--------------------------------------+
| [industry, display, Merchants]|
| [smart, swallow, game, Experience]|
| [social, picture, social]|
| [default, game, us, adventure]|
| [financial management, loan, product]|
| [system, profile, optimization]|
...
# After do LabelEncoder() on `tags` column
...
+--------------------------------------+
| tags|
+--------------------------------------+
| [0, 1, 2]|
| [3, 4, 4, 5]|
| [6, 7, 6]|
| [8, 4, 9, 10]|
| [11, 12, 13]|
| [14, 15, 16]|
Python版本将非常相似:
// add unique id to each row
val df2 = df.withColumn("id", monotonically_increasing_id).select('id, explode('tags).as("tag"))
val indexer = new StringIndexer()
.setInputCol("tag")
.setOutputCol("tagIndex")
val indexed = indexer.fit(df2).transform(df2)
// in the final step you should convert tags back to array of tags
val dfFinal = indexed.groupBy('id).agg(collect_list('tagIndex))
Python版本将非常相似:
// add unique id to each row
val df2 = df.withColumn("id", monotonically_increasing_id).select('id, explode('tags).as("tag"))
val indexer = new StringIndexer()
.setInputCol("tag")
.setOutputCol("tagIndex")
val indexed = indexer.fit(df2).transform(df2)
// in the final step you should convert tags back to array of tags
val dfFinal = indexed.groupBy('id).agg(collect_list('tagIndex))