Python 在PySpark数据帧上执行自定义热编码的最有效方法?

Python 在PySpark数据帧上执行自定义热编码的最有效方法?,python,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,假设我们有这个PySpark数据帧: +----+-------------+ | id | string_data | +----+-------------+ | 1 | "test" | +----+-------------+ | 2 | null | +----+-------------+ | 3 | "9" | +----+-------------+ | 4 | "deleted__" | +----+-------------+

假设我们有这个PySpark数据帧:

+----+-------------+
| id | string_data |
+----+-------------+
| 1  | "test"      |
+----+-------------+
| 2  | null        |
+----+-------------+
| 3  | "9"         |
+----+-------------+
| 4  | "deleted__" |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| id | string_data | is_string_data_null | is_string_data_a_number | does_string_data_contain_keyword_test | is_string_data_normal |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 1  | "test"      | 0                   | 0                       | 1                                     | 0                     |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 2  | null        | 1                   | 0                       | 0                                     | 0                     |im
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 3  | "9"         | 0                   | 1                       | 0                                     | 0                     |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 4  | "deleted__" | 0                   | 0                       | 0                                     | 1                     |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
|    |             |                     |                         |                                       |                       |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
我想对此执行一些操作,这将导致此数据帧:

+----+-------------+
| id | string_data |
+----+-------------+
| 1  | "test"      |
+----+-------------+
| 2  | null        |
+----+-------------+
| 3  | "9"         |
+----+-------------+
| 4  | "deleted__" |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| id | string_data | is_string_data_null | is_string_data_a_number | does_string_data_contain_keyword_test | is_string_data_normal |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 1  | "test"      | 0                   | 0                       | 1                                     | 0                     |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 2  | null        | 1                   | 0                       | 0                                     | 0                     |im
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 3  | "9"         | 0                   | 1                       | 0                                     | 0                     |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 4  | "deleted__" | 0                   | 0                       | 0                                     | 1                     |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
|    |             |                     |                         |                                       |                       |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
其中,根据真值,每个新列都有1或0。我目前已经使用一个自定义UDF实现了这一点,该自定义UDF检查string_数据列的值,但是速度非常慢。我还尝试实现一个UDF,它不创建新列,而是用编码向量[1,0,0…]覆盖原始列。这也太慢了,因为我们必须将其应用于数百万行和数千列

有没有更好的办法?我知道UDF不是PySpark中解决问题的最有效方法,但我似乎找不到任何内置的PySpark函数


任何想法都将不胜感激

编辑:对不起,从手机上看不到完整的预期输出,因此我之前的回答非常不完整

无论如何,您的操作必须分两步完成,从这个数据帧开始:

df.show() +---+-----------+ |id |字符串_数据| +---+-----------+ |1 |试验| |2 |零| | 3| 9| |4 |删除__| +---+-----------+
  • 根据
    string\u data
    字段中的条件创建布尔字段:
  • >>df=(df
    .withColumn('is\u string\u data\u null',df.string\u data.isNull())
    .withColumn('is\u string\u data\u number',df.string\u data.cast('integer')。isNotNull()
    .withColumn('字符串\数据\是否包含关键字\测试'),合并(df.string \数据,lit('')。包含('测试')
    .withColumn('is_string_normal',~(col('is_string_data_null'));col('is_string_data_a_number'));col('u string_data_是否包含关键字_test'))
    )
    >>>df.show()
    +---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
    |id |字符串|数据|为|字符串|数据|为|字符串|数据|是否包含|关键字|测试|字符串|是否正常|
    +---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
    |1 |测试|假|假|真|假|
    |2 |空|真|假|假|假|
    |3 | 9 |假|真|假|假|
    |4 |删除|假|假|假|真|
    +---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
    
  • 现在我们有了列,我们可以将它们转换为整数:
  • >>df=(df
    .withColumn('is\u string\u data\u null',df.is\u string\u data\u null.cast('integer'))
    .withColumn('is_string_data_a_number',df.is_string_data_a_number.cast('integer'))
    .withColumn('字符串\数据\是否包含关键字\测试'),df.字符串\数据\是否包含关键字\测试.cast('整数'))
    .withColumn('is\u string\u normal',df.is\u string\u normal.cast('integer'))
    )
    >>>df.show()
    +---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
    |id |字符串|数据|为|字符串|数据|为|字符串|数据|是否包含|关键字|测试|字符串|是否正常|
    +---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
    |1 |测试| 0 | 0 | 1 | 0|
    |2 |空| 1 | 0 | 0 | 0|
    |  3|          9|                  0|                      1|                                    0|               0|
    |4 |删除| 0 | 0 | 0 | 1|
    +---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
    
    这应该比UDF性能好得多,因为所有操作都是由Spark本身完成的,因此没有从Spark到Python的上下文切换