Scala Spark将列拆分为数组和聚合计算
我有以下格式的数据帧:Scala Spark将列拆分为数组和聚合计算,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有以下格式的数据帧: COL_1 COL_2 COL_3 ----- ----- ----- TEXT1 TEXT2 ["www.a.com/abc", "www.b.com/dgh", "www.c.com/axy", "www.a.com/xyz"] TEXT3 TEXT4 ["www.a.com/abc", "www.d.com/dgh", "www.a.com/axy", "www.f.com/xyz", "www.f.com/xyz
COL_1 COL_2 COL_3
----- ----- -----
TEXT1 TEXT2 ["www.a.com/abc", "www.b.com/dgh", "www.c.com/axy", "www.a.com/xyz"]
TEXT3 TEXT4 ["www.a.com/abc", "www.d.com/dgh", "www.a.com/axy", "www.f.com/xyz", "www.f.com/xyz", "www.a.com/xyz"]
TEXT5 TEXT6 ["www.v.com/abc", "www.c.com/axy"]
所有列都是字符串。我想在spark中做什么:
- 将第3列拆分为单独的URL
- 提取域名,然后计算该行中来自域“a.com”的URL的百分比
- 如果“a.com”的百分比超过了该行中的某个数量(例如,该行中50%的URL来自a.com),我想将每个URL路径映射到一个单独的列(列1和列2)
COL_1 COL_2 COL_MAP_REDUCED
----- ----- -----
TEXT1 TEXT2 abc
TEXT1 TEXT2 xyz
TEXT3 TEXT4 xyz
TEXT3 TEXT4 axy
我不是在找人帮我解决这个问题,我是在寻找如何开始的指导,因为我的GoogleFoo让我失望
多谢各位
val df = // your dataframe creation code
val res = df.map{r=>
val col3 = r.getAs[String](“col_3”)
val col2= r.getAs[String](“col_2”)
val col1= r.getAs[String](“col_1”)
//operate on col3 as you wish
val col4_ = yourFunc(col3)
//if col4 is of type Seq or Array then you can flatten it by using flatMap or explode function
val col4 = Seq("x","y")
var explodedResult = Seq[Tuple4[String,String,String,String]]()
col4.foreach{ element =>
explodedResult = explodedResult :+ (col1, col2, col3, element)
explodedResult
}.flatMap(identity(_))
新数据帧“res”将包含所有现有列以及新的计算结果
如果存储此数据帧,则大量数据将被复制。您也可以将结果存储为数组,类似于(col1:String、col2:String、col3:String、col4:Seq[String])
。如果需要在单独的一行中重复col4中的每一个值,则可以使用explode
函数将col4的每一行分解为整个数据帧的一行。其语法是df.withColumn(“col4”),explode(col(“col4”)).show()
将第3列拆分为单独的URL
提取域名,然后计算URL的百分比
来自该行中的域“a.com”
如果“a.com”的百分比超过该行的某个金额(50%)
那一行中的URL中有30个来自a.com(例如),我想映射
每个URL路径都指向一个单独的列(列1和列2)
您可以通过从dataframedf1
获取col_1
和col_2
的组计数来获得百分比,并使用dataframdf2
的相同计数进行计算
为您的愿望输出
scala> df2.withColumn("COL_MAP_REDUCED", split(col("COL_3"),"/")(1)).drop("COL_3").show
+-----+-----+---------------+
|COL_1|COL_2|COL_MAP_REDUCED|
+-----+-----+---------------+
|TEXT1|TEXT2| abc|
|TEXT1|TEXT2| xyz|
|TEXT3|TEXT4| abc|
|TEXT3|TEXT4| axy|
|TEXT3|TEXT4| xyz|
+-----+-----+---------------+
感谢您的提问。为每个col4输出单独设置一行如何(在我的示例中,COL\u MAP\u减少)?我已经更新了答案。如果这有效,请对答案进行投票。
scala> val df1 = df.withColumn("COL_3", regexp_replace(col("COL_3"), "[\\] \" \\[]",""))
.withColumn("COL_3", explode(split(col("COL_3"), ",")))
scala> df1.show(false)
+-----+-----+-------------+
|COL_1|COL_2|COL_3 |
+-----+-----+-------------+
|TEXT1|TEXT2|www.a.com/abc|
|TEXT1|TEXT2|www.b.com/dgh|
|TEXT1|TEXT2|www.c.com/axy|
|TEXT1|TEXT2|www.a.com/xyz|
|TEXT3|TEXT4|www.a.com/abc|
|TEXT3|TEXT4|www.d.com/dgh|
|TEXT3|TEXT4|www.a.com/axy|
|TEXT3|TEXT4|www.f.com/xyz|
|TEXT3|TEXT4|www.f.com/xyz|
|TEXT3|TEXT4|www.a.com/xyz|
|TEXT5|TEXT6|www.v.com/abc|
|TEXT5|TEXT6|www.c.com/axy|
+-----+-----+-------------+
scala> val df2 = df1.filter(col("COL_3").like("%a.com%"))
scala> df2.show
+-----+-----+-------------+
|COL_1|COL_2| COL_3|
+-----+-----+-------------+
|TEXT1|TEXT2|www.a.com/abc|
|TEXT1|TEXT2|www.a.com/xyz|
|TEXT3|TEXT4|www.a.com/abc|
|TEXT3|TEXT4|www.a.com/axy|
|TEXT3|TEXT4|www.a.com/xyz|
+-----+-----+-------------+
scala> df2.withColumn("COL_MAP_REDUCED", split(col("COL_3"),"/")(1)).drop("COL_3").show
+-----+-----+---------------+
|COL_1|COL_2|COL_MAP_REDUCED|
+-----+-----+---------------+
|TEXT1|TEXT2| abc|
|TEXT1|TEXT2| xyz|
|TEXT3|TEXT4| abc|
|TEXT3|TEXT4| axy|
|TEXT3|TEXT4| xyz|
+-----+-----+---------------+