Apache spark 优化spark数据帧操作
我有一个模式的sparkversion-2.4数据帧Apache spark 优化spark数据帧操作,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我有一个模式的sparkversion-2.4数据帧 +----------+ | ColumnA | +----------+ | 1000@Cat | | 1001@Dog | | 1000@Cat | | 1001@Dog | | 1001@Dog | +----------+ 我使用下面的代码有条件地应用正则表达式除去附加到字符串中的数字 dataset.withColumn("ColumnA",when(regexp_extract(dataset.col("ColumnA"),
+----------+
| ColumnA |
+----------+
| 1000@Cat |
| 1001@Dog |
| 1000@Cat |
| 1001@Dog |
| 1001@Dog |
+----------+
我使用下面的代码有条件地应用正则表达式除去附加到字符串中的数字
dataset.withColumn("ColumnA",when(regexp_extract(dataset.col("ColumnA"), "\\@(.*)", 1)
.equalTo(""), dataset.col("ColumnA"))
.otherwise(regexp_extract(dataset.col("ColumnA"), "\\@(.*)", 1)));
这将导致以下格式的数据帧
+---------+
| ColumnA |
+---------+
| Cat |
| Dog |
| Cat |
| Dog |
| Dog |
+---------+
这将正确运行并产生所需的输出
但是,regexp\u extract操作应用了两次,一次是检查返回的字符串是否为空,如果不是,则在列上重新应用regexp\u extract
是否可以对此代码进行任何优化以使其性能更好。使用拆分函数而不是regexp\u extract
请检查下面的代码和执行时间
scala> df.show(false)
+--------+
|columna |
+--------+
|1000@Cat|
|1001@Dog|
|1000@Cat|
|1001@Dog|
|1001@Dog|
+--------+
scala> spark.time(df.withColumn("parsed",split($"columna","@")(1)).show(false))
+--------+------+
|columna |parsed|
+--------+------+
|1000@Cat|Cat |
|1001@Dog|Dog |
|1000@Cat|Cat |
|1001@Dog|Dog |
|1001@Dog|Dog |
+--------+------+
Time taken: 14 ms
scala> spark.time { df.withColumn("ColumnA",when(regexp_extract($"columna", "\\@(.*)", 1).equalTo(""), $"columna").otherwise(regexp_extract($"columna", "\\@(.*)", 1))).show(false) }
+-------+
|ColumnA|
+-------+
|Cat |
|Dog |
|Cat |
|Dog |
|Dog |
+-------+
Time taken: 22 ms
scala>
包含检查列中@value的函数
scala> spark.time(df.withColumn("parsed",when($"columna".contains("@"), lit(split($"columna","@")(1))).otherwise("")).show(false))
+--------+------+
|columna |parsed|
+--------+------+
|1000@Cat|Cat |
|1001@Dog|Dog |
|1000@Cat|Cat |
|1001@Dog|Dog |
|1001@Dog|Dog |
+--------+------+
Time taken: 14 ms