Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 将rlike与spark 1.5.1中的正则表达式列一起使用_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

Scala 将rlike与spark 1.5.1中的正则表达式列一起使用

Scala 将rlike与spark 1.5.1中的正则表达式列一起使用,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我想根据将其中一列中的正则表达式值应用于另一列来过滤dataframe Example: Id Column1 RegexColumm 1 Abc A.* 2 Def B.* 3 Ghi G.* 使用RegexColumm筛选数据帧的结果应该给出id为1和3的行 在spark 1.5.1中是否有这样做的方法?不想使用UDF,因为这可能会导致可伸缩性问题,寻找spark原生api。您可以转换df->rdd然后通过遍历行,我们可以匹配regex,只过滤出匹配的数据,而

我想根据将其中一列中的正则表达式值应用于另一列来过滤dataframe

Example:
Id Column1 RegexColumm
1  Abc     A.*
2  Def     B.*
3  Ghi     G.*
使用RegexColumm筛选数据帧的结果应该给出id为1和3的行


在spark 1.5.1中是否有这样做的方法?不想使用UDF,因为这可能会导致可伸缩性问题,寻找spark原生api。

您可以转换
df->rdd
然后通过遍历行,我们可以匹配
regex
只过滤出匹配的数据,而不使用任何
UDF

示例:

import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._

 df.show()
//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//|  1|    Abc|     A.*|
//|  2|    Def|     B.*|
//|  3|    Ghi|     G.*|
//+---+-------+--------+

//creating new schema to add new boolean field
val sch = StructType(df.schema.fields ++ Array(StructField("bool_col", BooleanType, false)))

//convert df to rdd and match the regex using .map
val rdd = df.rdd.map(row => {
  val regex = row.getAs[String]("regexCol")
  val bool = row.getAs[String]("column1").matches(regex)
  val bool_col = s"$bool".toBoolean
  val newRow = Row.fromSeq(row.toSeq ++ Array(bool_col))
  newRow
})

//convert rdd to dataframe filter out true values for bool_col
val final_df = sqlContext.createDataFrame(rdd, sch).where(col("bool_col")).drop("bool_col")
final_df.show(10)

//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//|  1|    Abc|     A.*|
//|  3|    Ghi|     G.*|
//+---+-------+--------+

更新:

import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._

 df.show()
//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//|  1|    Abc|     A.*|
//|  2|    Def|     B.*|
//|  3|    Ghi|     G.*|
//+---+-------+--------+

//creating new schema to add new boolean field
val sch = StructType(df.schema.fields ++ Array(StructField("bool_col", BooleanType, false)))

//convert df to rdd and match the regex using .map
val rdd = df.rdd.map(row => {
  val regex = row.getAs[String]("regexCol")
  val bool = row.getAs[String]("column1").matches(regex)
  val bool_col = s"$bool".toBoolean
  val newRow = Row.fromSeq(row.toSeq ++ Array(bool_col))
  newRow
})

//convert rdd to dataframe filter out true values for bool_col
val final_df = sqlContext.createDataFrame(rdd, sch).where(col("bool_col")).drop("bool_col")
final_df.show(10)

//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//|  1|    Abc|     A.*|
//|  3|    Ghi|     G.*|
//+---+-------+--------+
我们可以使用
.mapPartition
(),而不是
.map


你可以像上面那样使用,我想这就是你想要的。请务必让我知道它是否对您有帮助。

但是遍历每一行的效率较低,并且不太可扩展。寻找本机spark api,类似于在spark>2@SanjanaS,检查我的更新答案@舒:我认为
row.toSeq:+bool\u col
在性能方面会稍微好一点:)@SanjanaS,这个答案有助于你解决这个问题吗?如果是,请接受答案以关闭已解决的线程!
scala> val df = Seq((1,"Abc","A.*"),(2,"Def","B.*"),(3,"Ghi","G.*")).toDF("id","Column1","RegexColumm")
df: org.apache.spark.sql.DataFrame = [id: int, Column1: string ... 1 more field]

scala> val requiredDF = df.filter(x=> x.getAs[String]("Column1").matches(x.getAs[String]("RegexColumm")))
requiredDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, Column1: string ... 1 more field]

scala> requiredDF.show
+---+-------+-----------+
| id|Column1|RegexColumm|
+---+-------+-----------+
|  1|    Abc|        A.*|
|  3|    Ghi|        G.*|
+---+-------+-----------+