Java Scala中用于提取相关数据的Spark UDF_Java_Regex_Scala_Apache Spark_Dataframe

Java Scala中用于提取相关数据的Spark UDF

java regex scala apache-spark dataframe

Java Scala中用于提取相关数据的Spark UDF,java,regex,scala,apache-spark,dataframe,Java,Regex,Scala,Apache Spark,Dataframe,我有一个数据框，其中有一列需要清理。我期待一个可以应用于Java/Scala中Spark UDF的正则表达式模式，它可以从字符串中提取有效内容列的示例输入行userId，如下面的数据框所示： [[105286112,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [115090439,2017-11-19_14-16 >> ABCDE >> GrocersRetail >

我有一个数据框，其中有一列需要清理。我期待一个可以应用于Java/Scala中Spark UDF的正则表达式模式，它可以从字符串中提取有效内容

列的示例输入行
userId
，如下面的数据框所示：

[[105286112,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [115090439,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [29818926,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX]]

+--------------------+--------------------+
|    dt_geo_cat_brand|        userId      |
+--------------------+--------------------+
|2017-10-30_17-18 ...|[[133207500,2017-...|
|2017-10-19_21-22 ...|[[194112773,2017-...|
|2017-10-29_17-18 ...|[[274188233,2017-...|
|2017-10-29_14-16 ...|[[86281353,2017-1...|
|2017-10-01_09-10 ...|[[92478766,2017-1...|
|2017-10-09_17-18 ...|[[156663365,2017-...|
|2017-10-06_17-18 ...|[[111869972,2017-...|
|2017-10-13_09-10 ...|[[64404465,2017-1...|
|2017-10-13_07-08 ...|[[146355663,2017-...|
|2017-10-22_21-22 ...|[[54096488,2017-1...|
+--------------------+--------------------+

root
 |-- dt_geo_cat_brand: string (nullable = true)
 |-- userId: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: string (nullable = true)

名为“userId”的列的预期转换：

[[105286112,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [115090439,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [29818926,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX]]

+--------------------+--------------------+
|    dt_geo_cat_brand|        userId      |
+--------------------+--------------------+
|2017-10-30_17-18 ...|[[133207500,2017-...|
|2017-10-19_21-22 ...|[[194112773,2017-...|
|2017-10-29_17-18 ...|[[274188233,2017-...|
|2017-10-29_14-16 ...|[[86281353,2017-1...|
|2017-10-01_09-10 ...|[[92478766,2017-1...|
|2017-10-09_17-18 ...|[[156663365,2017-...|
|2017-10-06_17-18 ...|[[111869972,2017-...|
|2017-10-13_09-10 ...|[[64404465,2017-1...|
|2017-10-13_07-08 ...|[[146355663,2017-...|
|2017-10-22_21-22 ...|[[54096488,2017-1...|
+--------------------+--------------------+

root
 |-- dt_geo_cat_brand: string (nullable = true)
 |-- userId: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: string (nullable = true)

一个字符串，如下所示：

105286112|115090439|29818926

我需要修改

userId

列的逻辑/方法，以便生成相同的UDF。使用正则表达式或其他方法会发生这种情况吗

输入数据框如下所示：

[[105286112,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [115090439,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [29818926,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX]]

+--------------------+--------------------+
|    dt_geo_cat_brand|        userId      |
+--------------------+--------------------+
|2017-10-30_17-18 ...|[[133207500,2017-...|
|2017-10-19_21-22 ...|[[194112773,2017-...|
|2017-10-29_17-18 ...|[[274188233,2017-...|
|2017-10-29_14-16 ...|[[86281353,2017-1...|
|2017-10-01_09-10 ...|[[92478766,2017-1...|
|2017-10-09_17-18 ...|[[156663365,2017-...|
|2017-10-06_17-18 ...|[[111869972,2017-...|
|2017-10-13_09-10 ...|[[64404465,2017-1...|
|2017-10-13_07-08 ...|[[146355663,2017-...|
|2017-10-22_21-22 ...|[[54096488,2017-1...|
+--------------------+--------------------+

root
 |-- dt_geo_cat_brand: string (nullable = true)
 |-- userId: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: string (nullable = true)

模式：

[[105286112,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [115090439,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [29818926,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX]]

+--------------------+--------------------+
|    dt_geo_cat_brand|        userId      |
+--------------------+--------------------+
|2017-10-30_17-18 ...|[[133207500,2017-...|
|2017-10-19_21-22 ...|[[194112773,2017-...|
|2017-10-29_17-18 ...|[[274188233,2017-...|
|2017-10-29_14-16 ...|[[86281353,2017-1...|
|2017-10-01_09-10 ...|[[92478766,2017-1...|
|2017-10-09_17-18 ...|[[156663365,2017-...|
|2017-10-06_17-18 ...|[[111869972,2017-...|
|2017-10-13_09-10 ...|[[64404465,2017-1...|
|2017-10-13_07-08 ...|[[146355663,2017-...|
|2017-10-22_21-22 ...|[[54096488,2017-1...|
+--------------------+--------------------+

root
 |-- dt_geo_cat_brand: string (nullable = true)
 |-- userId: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: string (nullable = true)

所需输出：

+--------------------+--------------------+
|    dt_geo_cat_brand|         userId     |
+--------------------+--------------------+
|2017-10-30_17-18 ...|133207500,1993333444|
|2017-10-19_21-22 ...|122122212,3432323333|
|2017-10-29_17-18 ...|274188233,8869696966|
|2017-10-29_14-16 ...|862813534,444344444,43444343434|
|2017-10-01_09-10 ...|92478766,880342342,4243244432,5554335535|
+--------------------+--------------------+

等等…

使用下面的正则表达式编写UDF。它将提取所需的内容

import ss.implicits._

val df = ss.read.csv(path).as("")
df.show()

val reg = "\\[\\[(\\d*).*\\],\\s*\\[(\\d*).*\\],\\s*\\[(\\d*).*" // regex which can extract the required data

val input = "[[105286112,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [115090439,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [29818926,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX]]"   // input string
val mat = reg.r.findAllIn(input)  // extracting the data

println(mat)
while (mat.hasNext) {
    mat.next()
    println(mat.group(1) + "|" + mat.group(2)+ "|" +  mat.group(3)) // each group will print the 3 extracted fields
}

输出：

105286112|115090439|29818926

使用UDF：

import ss.implicits._

    val reg = "\\[\\[(\\d*).*\\],\\s*\\[(\\d*).*\\],\\s*\\[(\\d*).*"

    def reg_func = { (s: String) =>
        {
            val mat = reg.r.findAllIn(s)

            println(mat)
            var out = ""
            while (mat.hasNext) {
                mat.next()
                out = mat.group(1) + "|" + mat.group(2) + "|" + mat.group(3)
            }
            out
        }
    }

    val reg_udf = udf(reg_func)

    val df = ss.read.text(path)
    .withColumn("Extracted_fields", reg_udf($"value"))
    df.show(false)

输入：创建了一些示例记录

[[105286112,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [115090439,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [29818926,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX]]
[[105286113,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [115090440,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [29818927,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX]]

输出：

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+
|value                                                                                                                                                                                       |Extracted_fields            |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+
|[[105286112,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [115090439,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [29818926,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX]]|105286112|115090439|29818926|
|[[105286113,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [115090440,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [29818927,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX]]|105286113|115090440|29818927|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+

您不需要正则表达式来解决此问题。数据被格式化为一个结构数组，查看模式，您需要的是每个结构的

\u 1

字符串。这可以通过UDF解决，UDF提取值，然后使用

mkString（“|”）将所有内容转换为字符串，以获得预期的输出：
val extract_id=udf（（arr:Seq[Row]）=>{
arr.map（u.getAs[String]（0））.mkString（“|”）
})
df.withColumn（“userId”，extract_id（$“userId”））


根据评论添加第1条：
[[105286112,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [115090439,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [29818926,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX]]

+--------------------+--------------------+
|    dt_geo_cat_brand|        userId      |
+--------------------+--------------------+
|2017-10-30_17-18 ...|[[133207500,2017-...|
|2017-10-19_21-22 ...|[[194112773,2017-...|
|2017-10-29_17-18 ...|[[274188233,2017-...|
|2017-10-29_14-16 ...|[[86281353,2017-1...|
|2017-10-01_09-10 ...|[[92478766,2017-1...|
|2017-10-09_17-18 ...|[[156663365,2017-...|
|2017-10-06_17-18 ...|[[111869972,2017-...|
|2017-10-13_09-10 ...|[[64404465,2017-1...|
|2017-10-13_07-08 ...|[[146355663,2017-...|
|2017-10-22_21-22 ...|[[54096488,2017-1...|
+--------------------+--------------------+

root
 |-- dt_geo_cat_brand: string (nullable = true)
 |-- userId: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: string (nullable = true)

如果要将在dt_geo_cat_brand
上分区的结果保存在csv文件中（所有值都在其自己的行中），可以按如下操作。首先，从udf返回一个列表，而不是字符串，并使用explode
：
val extract_id=udf（（arr:Seq[Row]）=>{
arr.map（u.getAs[String]（0））
})
val df2=df.withColumn（“userId”，explode（extract_id（$“userId”））

然后在保存时使用partitionBy（dt\u geo\u cat\u品牌）
。这将根据dt_geo_cat_brand
列中的值创建文件夹结构。根据分区的不同，每个文件夹中的csv文件数量可能会有所不同，但它们都具有dt\u geo\u cat\u brand
中单个值的值（如果需要单个文件并具有足够的内存，请在保存之前使用重新分区（1）
）

根据评论第2条增加：
[[105286112,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [115090439,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX], [29818926,2017-11-19_14-16 >> ABCDE >> GrocersRetail >> XXX]]

+--------------------+--------------------+
|    dt_geo_cat_brand|        userId      |
+--------------------+--------------------+
|2017-10-30_17-18 ...|[[133207500,2017-...|
|2017-10-19_21-22 ...|[[194112773,2017-...|
|2017-10-29_17-18 ...|[[274188233,2017-...|
|2017-10-29_14-16 ...|[[86281353,2017-1...|
|2017-10-01_09-10 ...|[[92478766,2017-1...|
|2017-10-09_17-18 ...|[[156663365,2017-...|
|2017-10-06_17-18 ...|[[111869972,2017-...|
|2017-10-13_09-10 ...|[[64404465,2017-1...|
|2017-10-13_07-08 ...|[[146355663,2017-...|
|2017-10-22_21-22 ...|[[54096488,2017-1...|
+--------------------+--------------------+

root
 |-- dt_geo_cat_brand: string (nullable = true)
 |-- userId: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: string (nullable = true)

要在保存为单独文件时不使用partitionBy
，可以执行以下操作（建议使用partitionoby
appraoch）。首先，在dt_geo_cat_brand
中查找所有不同的值：
val vals = df.select("dt_geo_cat_brand").distinct().as[String].collect()

对于每个值，过滤数据帧并将其保存（此处使用分解的df2
dataframe作为加法#1）：
或者，如果使用udf，则不要使用分解的数据帧，而是在“|”
上拆分：
vals.foreach { v =>
  df.filter($"dt_geo_cat_brand" === v)
    .select(split($"userId", "\\|").as("userId"))
    .write
    .csv(s"$baseOutputBucketPath=$v/")})
}

但是为什么要用正则表达式从数据帧中提取数据呢？我需要使用从该列中提取的值（数值），以便在处理模型的稍后点生成位图。你之所以看到这样的数据，是因为我习惯于Cassandra按键对数据进行分组，并根据键将值组合在一起。嗨，可以提取3个以上的字段，目前你使用的是mat.group（1）+“|”+mat.group（2）+“|”+mat.group（3），我们可以使其动态化而不是硬编码吗？嗨@Shaido，救命！Udf生成-ABCD，2323 | 4343434 | 644646 | 54545 | 4756456 EFGH，456464564 | 432444 | 4244554 | 525454我想基于第一列存储id（ABCD/EFGH）-ABCD分区应该有一个csv文件，id之间用\n-222323233 323232分隔。我尝试了-import sparkSession.implicits.\uDataFrame.collect.foreach（t=>{val dt_geo_cat_brand=t.dt_geo_cat_brand val mbid=t.mbid.split（\\\\\\\”）.toList.toDF（“mbid”）mbid.repartition（1）.write.csv（s“$baseOutputBucketPath=$dt geo_cat_brand/”）由于内存问题而失败？如何并行？嗨@CodeReaper，因为在评论中回答有点太长了，所以我对上面的答案做了补充。希望它有帮助：）如果我使用partitionBy，它会导致大量的混乱吗？我最初为每一行连接ID的原因是为了避免基于第一列的partitionBy。最初我加载了整个数据，在第一列上进行了分区，编写分区花费了很长时间。所以我用Cassandra给我的身份证是连续的。我想避免数据混乱。@CodeReaper:partitionBy
不应导致任何数据混乱，只要您将数据保存在节点所属的分布式文件系统（例如HDFS）上（但是，重新分区
将和收集
将把所有数据放在驱动程序节点上）。这个答案解释得更清楚一点：如果我们使用explode，那么从技术上讲，每一行数据都会被分解成尽可能多的数值。ABCD，444577788899，我们能分成三行吗？然后当我们使用partitionBy时，它必须从所有执行器中获取对应于每个键的值？此外，我已经按照您上次的UDF处理了数据，我正在寻找一个解决方案，在该阶段之后并行地写出数据。