Scala 将rlike与spark 1.5.1中的正则表达式列一起使用
我想根据将其中一列中的正则表达式值应用于另一列来过滤dataframeScala 将rlike与spark 1.5.1中的正则表达式列一起使用,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我想根据将其中一列中的正则表达式值应用于另一列来过滤dataframe Example: Id Column1 RegexColumm 1 Abc A.* 2 Def B.* 3 Ghi G.* 使用RegexColumm筛选数据帧的结果应该给出id为1和3的行 在spark 1.5.1中是否有这样做的方法?不想使用UDF,因为这可能会导致可伸缩性问题,寻找spark原生api。您可以转换df->rdd然后通过遍历行,我们可以匹配regex,只过滤出匹配的数据,而
Example:
Id Column1 RegexColumm
1 Abc A.*
2 Def B.*
3 Ghi G.*
使用RegexColumm筛选数据帧的结果应该给出id为1和3的行
在spark 1.5.1中是否有这样做的方法?不想使用UDF,因为这可能会导致可伸缩性问题,寻找spark原生api。您可以转换
df->rdd
然后通过遍历行,我们可以匹配regex
,只过滤出匹配的数据,而不使用任何UDF
示例:
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
df.show()
//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//| 1| Abc| A.*|
//| 2| Def| B.*|
//| 3| Ghi| G.*|
//+---+-------+--------+
//creating new schema to add new boolean field
val sch = StructType(df.schema.fields ++ Array(StructField("bool_col", BooleanType, false)))
//convert df to rdd and match the regex using .map
val rdd = df.rdd.map(row => {
val regex = row.getAs[String]("regexCol")
val bool = row.getAs[String]("column1").matches(regex)
val bool_col = s"$bool".toBoolean
val newRow = Row.fromSeq(row.toSeq ++ Array(bool_col))
newRow
})
//convert rdd to dataframe filter out true values for bool_col
val final_df = sqlContext.createDataFrame(rdd, sch).where(col("bool_col")).drop("bool_col")
final_df.show(10)
//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//| 1| Abc| A.*|
//| 3| Ghi| G.*|
//+---+-------+--------+
更新:
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
df.show()
//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//| 1| Abc| A.*|
//| 2| Def| B.*|
//| 3| Ghi| G.*|
//+---+-------+--------+
//creating new schema to add new boolean field
val sch = StructType(df.schema.fields ++ Array(StructField("bool_col", BooleanType, false)))
//convert df to rdd and match the regex using .map
val rdd = df.rdd.map(row => {
val regex = row.getAs[String]("regexCol")
val bool = row.getAs[String]("column1").matches(regex)
val bool_col = s"$bool".toBoolean
val newRow = Row.fromSeq(row.toSeq ++ Array(bool_col))
newRow
})
//convert rdd to dataframe filter out true values for bool_col
val final_df = sqlContext.createDataFrame(rdd, sch).where(col("bool_col")).drop("bool_col")
final_df.show(10)
//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//| 1| Abc| A.*|
//| 3| Ghi| G.*|
//+---+-------+--------+
我们可以使用.mapPartition
(),而不是.map
:
你可以像上面那样使用,我想这就是你想要的。请务必让我知道它是否对您有帮助。但是遍历每一行的效率较低,并且不太可扩展。寻找本机spark api,类似于在spark>2@SanjanaS,检查我的更新答案@舒:我认为
row.toSeq:+bool\u col
在性能方面会稍微好一点:)@SanjanaS,这个答案有助于你解决这个问题吗?如果是,请接受答案以关闭已解决的线程!
scala> val df = Seq((1,"Abc","A.*"),(2,"Def","B.*"),(3,"Ghi","G.*")).toDF("id","Column1","RegexColumm")
df: org.apache.spark.sql.DataFrame = [id: int, Column1: string ... 1 more field]
scala> val requiredDF = df.filter(x=> x.getAs[String]("Column1").matches(x.getAs[String]("RegexColumm")))
requiredDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, Column1: string ... 1 more field]
scala> requiredDF.show
+---+-------+-----------+
| id|Column1|RegexColumm|
+---+-------+-----------+
| 1| Abc| A.*|
| 3| Ghi| G.*|
+---+-------+-----------+