Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark将列拆分为数组和聚合计算_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

Scala Spark将列拆分为数组和聚合计算

Scala Spark将列拆分为数组和聚合计算,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有以下格式的数据帧: COL_1 COL_2 COL_3 ----- ----- ----- TEXT1 TEXT2 ["www.a.com/abc", "www.b.com/dgh", "www.c.com/axy", "www.a.com/xyz"] TEXT3 TEXT4 ["www.a.com/abc", "www.d.com/dgh", "www.a.com/axy", "www.f.com/xyz", "www.f.com/xyz

我有以下格式的数据帧:

COL_1    COL_2    COL_3
-----    -----    -----
TEXT1     TEXT2    ["www.a.com/abc", "www.b.com/dgh", "www.c.com/axy", "www.a.com/xyz"]
TEXT3     TEXT4    ["www.a.com/abc", "www.d.com/dgh", "www.a.com/axy", "www.f.com/xyz", "www.f.com/xyz", "www.a.com/xyz"]
TEXT5     TEXT6    ["www.v.com/abc", "www.c.com/axy"]
所有列都是字符串。我想在spark中做什么:

  • 将第3列拆分为单独的URL
  • 提取域名,然后计算该行中来自域“a.com”的URL的百分比
  • 如果“a.com”的百分比超过了该行中的某个数量(例如,该行中50%的URL来自a.com),我想将每个URL路径映射到一个单独的列(列1和列2)
对于上面的示例,输出类似于以下内容:

COL_1    COL_2    COL_MAP_REDUCED
-----    -----    -----
TEXT1     TEXT2    abc
TEXT1     TEXT2    xyz
TEXT3     TEXT4    xyz
TEXT3     TEXT4    axy
我不是在找人帮我解决这个问题,我是在寻找如何开始的指导,因为我的GoogleFoo让我失望

多谢各位

    val df = // your dataframe creation code
    val res = df.map{r=>
    val col3 = r.getAs[String](“col_3”)
    val col2= r.getAs[String](“col_2”)
    val col1= r.getAs[String](“col_1”)
    //operate on col3 as you wish
    val col4_ = yourFunc(col3)

    //if col4 is of type Seq or Array then you can flatten it by using flatMap or explode function

    val col4 = Seq("x","y")    
    var explodedResult = Seq[Tuple4[String,String,String,String]]()

    col4.foreach{ element =>
    explodedResult = explodedResult :+ (col1, col2, col3, element)
    explodedResult    
    }.flatMap(identity(_))
新数据帧“res”将包含所有现有列以及新的计算结果

如果存储此数据帧,则大量数据将被复制。您也可以将结果存储为数组,类似于
(col1:String、col2:String、col3:String、col4:Seq[String])
。如果需要在单独的一行中重复col4中的每一个值,则可以使用
explode
函数将col4的每一行分解为整个数据帧的一行。其语法是
df.withColumn(“col4”),explode(col(“col4”)).show()

将第3列拆分为单独的URL

提取域名,然后计算URL的百分比 来自该行中的域“a.com”

如果“a.com”的百分比超过该行的某个金额(50%) 那一行中的URL中有30个来自a.com(例如),我想映射 每个URL路径都指向一个单独的列(列1和列2)

您可以通过从dataframe
df1
获取
col_1
col_2
的组计数来获得百分比,并使用datafram
df2
的相同计数进行计算

为您的愿望输出

scala> df2.withColumn("COL_MAP_REDUCED", split(col("COL_3"),"/")(1)).drop("COL_3").show
+-----+-----+---------------+
|COL_1|COL_2|COL_MAP_REDUCED|
+-----+-----+---------------+
|TEXT1|TEXT2|            abc|
|TEXT1|TEXT2|            xyz|
|TEXT3|TEXT4|            abc|
|TEXT3|TEXT4|            axy|
|TEXT3|TEXT4|            xyz|
+-----+-----+---------------+

感谢您的提问。为每个col4输出单独设置一行如何(在我的示例中,COL\u MAP\u减少)?我已经更新了答案。如果这有效,请对答案进行投票。
scala> val df1 = df.withColumn("COL_3", regexp_replace(col("COL_3"), "[\\] \" \\[]",""))
                   .withColumn("COL_3", explode(split(col("COL_3"), ","))) 

scala> df1.show(false)
+-----+-----+-------------+
|COL_1|COL_2|COL_3        |
+-----+-----+-------------+
|TEXT1|TEXT2|www.a.com/abc|
|TEXT1|TEXT2|www.b.com/dgh|
|TEXT1|TEXT2|www.c.com/axy|
|TEXT1|TEXT2|www.a.com/xyz|
|TEXT3|TEXT4|www.a.com/abc|
|TEXT3|TEXT4|www.d.com/dgh|
|TEXT3|TEXT4|www.a.com/axy|
|TEXT3|TEXT4|www.f.com/xyz|
|TEXT3|TEXT4|www.f.com/xyz|
|TEXT3|TEXT4|www.a.com/xyz|
|TEXT5|TEXT6|www.v.com/abc|
|TEXT5|TEXT6|www.c.com/axy|
+-----+-----+-------------+
scala> val df2 = df1.filter(col("COL_3").like("%a.com%"))

scala> df2.show
+-----+-----+-------------+
|COL_1|COL_2|        COL_3|
+-----+-----+-------------+
|TEXT1|TEXT2|www.a.com/abc|
|TEXT1|TEXT2|www.a.com/xyz|
|TEXT3|TEXT4|www.a.com/abc|
|TEXT3|TEXT4|www.a.com/axy|
|TEXT3|TEXT4|www.a.com/xyz|
+-----+-----+-------------+
scala> df2.withColumn("COL_MAP_REDUCED", split(col("COL_3"),"/")(1)).drop("COL_3").show
+-----+-----+---------------+
|COL_1|COL_2|COL_MAP_REDUCED|
+-----+-----+---------------+
|TEXT1|TEXT2|            abc|
|TEXT1|TEXT2|            xyz|
|TEXT3|TEXT4|            abc|
|TEXT3|TEXT4|            axy|
|TEXT3|TEXT4|            xyz|
+-----+-----+---------------+