Python 在PySpark数据帧上执行自定义热编码的最有效方法?
假设我们有这个PySpark数据帧:Python 在PySpark数据帧上执行自定义热编码的最有效方法?,python,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,假设我们有这个PySpark数据帧: +----+-------------+ | id | string_data | +----+-------------+ | 1 | "test" | +----+-------------+ | 2 | null | +----+-------------+ | 3 | "9" | +----+-------------+ | 4 | "deleted__" | +----+-------------+
+----+-------------+
| id | string_data |
+----+-------------+
| 1 | "test" |
+----+-------------+
| 2 | null |
+----+-------------+
| 3 | "9" |
+----+-------------+
| 4 | "deleted__" |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| id | string_data | is_string_data_null | is_string_data_a_number | does_string_data_contain_keyword_test | is_string_data_normal |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 1 | "test" | 0 | 0 | 1 | 0 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 2 | null | 1 | 0 | 0 | 0 |im
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 3 | "9" | 0 | 1 | 0 | 0 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 4 | "deleted__" | 0 | 0 | 0 | 1 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| | | | | | |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
我想对此执行一些操作,这将导致此数据帧:
+----+-------------+
| id | string_data |
+----+-------------+
| 1 | "test" |
+----+-------------+
| 2 | null |
+----+-------------+
| 3 | "9" |
+----+-------------+
| 4 | "deleted__" |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| id | string_data | is_string_data_null | is_string_data_a_number | does_string_data_contain_keyword_test | is_string_data_normal |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 1 | "test" | 0 | 0 | 1 | 0 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 2 | null | 1 | 0 | 0 | 0 |im
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 3 | "9" | 0 | 1 | 0 | 0 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 4 | "deleted__" | 0 | 0 | 0 | 1 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| | | | | | |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
其中,根据真值,每个新列都有1或0。我目前已经使用一个自定义UDF实现了这一点,该自定义UDF检查string_数据列的值,但是速度非常慢。我还尝试实现一个UDF,它不创建新列,而是用编码向量[1,0,0…]覆盖原始列。这也太慢了,因为我们必须将其应用于数百万行和数千列
有没有更好的办法?我知道UDF不是PySpark中解决问题的最有效方法,但我似乎找不到任何内置的PySpark函数
任何想法都将不胜感激 编辑:对不起,从手机上看不到完整的预期输出,因此我之前的回答非常不完整 无论如何,您的操作必须分两步完成,从这个数据帧开始: df.show() +---+-----------+ |id |字符串_数据| +---+-----------+ |1 |试验| |2 |零| | 3| 9| |4 |删除__| +---+-----------+
string\u data
字段中的条件创建布尔字段:>>df=(df
.withColumn('is\u string\u data\u null',df.string\u data.isNull())
.withColumn('is\u string\u data\u number',df.string\u data.cast('integer')。isNotNull()
.withColumn('字符串\数据\是否包含关键字\测试'),合并(df.string \数据,lit('')。包含('测试')
.withColumn('is_string_normal',~(col('is_string_data_null'));col('is_string_data_a_number'));col('u string_data_是否包含关键字_test'))
)
>>>df.show()
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
|id |字符串|数据|为|字符串|数据|为|字符串|数据|是否包含|关键字|测试|字符串|是否正常|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
|1 |测试|假|假|真|假|
|2 |空|真|假|假|假|
|3 | 9 |假|真|假|假|
|4 |删除|假|假|假|真|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
>>df=(df
.withColumn('is\u string\u data\u null',df.is\u string\u data\u null.cast('integer'))
.withColumn('is_string_data_a_number',df.is_string_data_a_number.cast('integer'))
.withColumn('字符串\数据\是否包含关键字\测试'),df.字符串\数据\是否包含关键字\测试.cast('整数'))
.withColumn('is\u string\u normal',df.is\u string\u normal.cast('integer'))
)
>>>df.show()
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
|id |字符串|数据|为|字符串|数据|为|字符串|数据|是否包含|关键字|测试|字符串|是否正常|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
|1 |测试| 0 | 0 | 1 | 0|
|2 |空| 1 | 0 | 0 | 0|
| 3| 9| 0| 1| 0| 0|
|4 |删除| 0 | 0 | 0 | 1|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
这应该比UDF性能好得多,因为所有操作都是由Spark本身完成的,因此没有从Spark到Python的上下文切换