Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 计算行的秩_Scala_Apache Spark_Dataframe_Hive_Apache Spark Sql - Fatal编程技术网

Scala 计算行的秩

Scala 计算行的秩,scala,apache-spark,dataframe,hive,apache-spark-sql,Scala,Apache Spark,Dataframe,Hive,Apache Spark Sql,我想根据一个字段对用户id进行排名。对于字段的相同值,秩应相同。该数据在配置单元表中 e、 g 如何才能做到这一点?可以通过数据帧API使用rank窗口函数: import org.apache.spark.sql.functions.rank import org.apache.spark.sql.expressions.Window val w = Window.orderBy($"value") val df = sc.parallelize(Seq( ("a", 5), ("b"

我想根据一个字段对用户id进行排名。对于字段的相同值,秩应相同。该数据在配置单元表中

e、 g


如何才能做到这一点?

可以通过数据帧API使用
rank
窗口函数:

import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window

val w = Window.orderBy($"value")

val df = sc.parallelize(Seq(
  ("a", 5), ("b", 10), ("c", 5), ("d", 6)
)).toDF("user", "value")

df.select($"user", rank.over(w).alias("rank")).show

// +----+----+
// |user|rank|
// +----+----+
// |   a|   1|
// |   c|   1|
// |   d|   3|
// |   b|   4|
// +----+----+
或原始SQL:

df.registerTempTable("df")
sqlContext.sql("SELECT user, RANK() OVER (ORDER BY value) AS rank FROM df").show

// +----+----+
// |user|rank|
// +----+----+
// |   a|   1|
// |   c|   1|
// |   d|   3|
// |   b|   4|
// +----+----+
但效率极低

您也可以尝试使用RDDAPI,但这并不简单。首先,让我们将数据帧转换为RDD:

import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.RangePartitioner

val rdd: RDD[(Int, String)] = df.select($"value", $"user")
  .map{ case Row(value: Int, user: String) => (value, user) }

val partitioner = new RangePartitioner(rdd.partitions.size,  rdd)
val sorted =  rdd.repartitionAndSortWithinPartitions(partitioner)
接下来,我们必须计算每个分区的等级:

def rank(iter: Iterator[(Int,String)]) =  {
  val zero = List((-1L, Integer.MIN_VALUE, "", 1L))

  def f(acc: List[(Long,Int,String,Long)], x: (Int, String)) = 
    (acc.head, x) match {
      case (
        (prevRank: Long, prevValue: Int, _, offset: Long),
        (currValue: Int, label: String)) => {
      val newRank = if (prevValue == currValue) prevRank else prevRank + offset
      val newOffset = if (prevValue == currValue) offset + 1L else 1L
      (newRank, currValue, label, newOffset) :: acc
    }
  }

  iter.foldLeft(zero)(f).reverse.drop(1).map{case (rank, _, label, _) =>
    (rank, label)}.toIterator
}


val partRanks = sorted.mapPartitions(rank)
每个分区的偏移量

def getOffsets(sorted: RDD[(Int, String)]) = sorted
  .mapPartitionsWithIndex((i: Int, iter: Iterator[(Int, String)]) => 
    Iterator((i, iter.size)))
  .collect
  .foldLeft(List((-1, 0)))((acc: List[(Int, Int)], x: (Int, Int)) => 
    (x._1, x._2 + acc.head._2) :: acc)
  .toMap

val offsets = sc.broadcast(getOffsets(sorted))
最后的排名是:

def adjust(i: Int, iter: Iterator[(Long, String)]) = 
  iter.map{case (rank, label) => (rank + offsets.value(i - 1).toLong, label)}

val ranks = partRanks
  .mapPartitionsWithIndex(adjust)
  .map{case (i, label) => (1 + i , label)}

我认为这是一个很好的答案,但是,我们能更详细地解释一下为什么dataframe API在这里效率低下吗?@BlueSky,因为没有
partitionBy
Window
definition会将所有内容拖放到一个分区中。使用今天的
Daset
API,您可以重写
RDD
版本。再加上@zero323的答案,即使使用partitionBy,它也可能效率低下-例如,在某些类型的事务数据中,少数客户持有绝大多数事务是很常见的;我遇到过银行数据,其中一位客户有效地拥有银行所有交易的45%,这是因为该银行是做市商,并且(在数据中)是自己的客户。
def adjust(i: Int, iter: Iterator[(Long, String)]) = 
  iter.map{case (rank, label) => (rank + offsets.value(i - 1).toLong, label)}

val ranks = partRanks
  .mapPartitionsWithIndex(adjust)
  .map{case (i, label) => (1 + i , label)}