Apache spark 数据帧zipWithIndex

Apache spark 数据帧zipWithIndex,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我试图解决一个古老的问题,即向数据集中添加序列号。我正在使用数据帧,似乎没有与RDD.zipWithIndex相当的数据帧。另一方面,以下内容或多或少符合我的要求: val origDF = sqlContext.load(...) val seqDF= sqlContext.createDataFrame( origDF.rdd.zipWithIndex.map(ln => Row.fromSeq(Seq(ln._2) ++ ln._1.toSeq)), Str

我试图解决一个古老的问题,即向数据集中添加序列号。我正在使用数据帧,似乎没有与RDD.zipWithIndex相当的数据帧。另一方面,以下内容或多或少符合我的要求:

val origDF = sqlContext.load(...)    

val seqDF= sqlContext.createDataFrame(
    origDF.rdd.zipWithIndex.map(ln => Row.fromSeq(Seq(ln._2) ++ ln._1.toSeq)),
    StructType(Array(StructField("seq", LongType, false)) ++ origDF.schema.fields)
)
在我的实际应用程序中,origDF不会直接从文件中加载——它将通过将2-3个其他数据帧连接在一起创建,并将包含超过1亿行

有更好的方法吗?我能做些什么来优化它呢?

以下内容是代表大卫·格里芬(David Griffin)(编辑无误)发布的

所有的歌唱,所有的舞蹈都是用指数法。您可以设置起始偏移量(默认为1)、索引列名(默认为“id”),并将列放在前面或后面:

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{LongType, StructField, StructType}
import org.apache.spark.sql.Row


def dfZipWithIndex(
  df: DataFrame,
  offset: Int = 1,
  colName: String = "id",
  inFront: Boolean = true
) : DataFrame = {
  df.sqlContext.createDataFrame(
    df.rdd.zipWithIndex.map(ln =>
      Row.fromSeq(
        (if (inFront) Seq(ln._2 + offset) else Seq())
          ++ ln._1.toSeq ++
        (if (inFront) Seq() else Seq(ln._2 + offset))
      )
    ),
    StructType(
      (if (inFront) Array(StructField(colName,LongType,false)) else Array[StructField]()) 
        ++ df.schema.fields ++ 
      (if (inFront) Array[StructField]() else Array(StructField(colName,LongType,false)))
    )
  ) 
}

从Spark 1.5开始,Spark中添加了
窗口
表达式。您现在可以使用
org.apache.spark.sql.expressions.row_number
,而不必将
DataFrame
转换为
RDD
。请注意,我发现上面的
dfZipWithIndex
算法的性能要比下面的算法快得多。但我发布它是因为:

  • 其他人会被诱惑去尝试这个
  • 也许有人可以优化下面的表达式
  • 无论如何,以下是对我有效的方法:

    import org.apache.spark.sql.expressions._
    
    df.withColumn("row_num", row_number.over(Window.partitionBy(lit(1)).orderBy(lit(1))))
    
    请注意,我对分区和排序都使用了
    lit(1)
    ,这使得所有内容都在同一个分区中,并且似乎保留了
    数据帧的原始顺序,但我认为这是导致其速度减慢的原因

    我在一个包含7000000行的4列
    数据帧上测试了它,它与上面的
    dfZipWithIndex
    之间的速度差异很大(就像我说的,
    RDD
    函数要快得多)。

    PySpark版本:

    from pyspark.sql.types import LongType, StructField, StructType
    
    def dfZipWithIndex (df, offset=1, colName="rowId"):
        '''
            Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe 
            and preserves a schema
    
            :param df: source dataframe
            :param offset: adjustment to zipWithIndex()'s index
            :param colName: name of the index column
        '''
    
        new_schema = StructType(
                        [StructField(colName,LongType(),True)]        # new added field in front
                        + df.schema.fields                            # previous schema
                    )
    
        zipped_rdd = df.rdd.zipWithIndex()
    
        new_rdd = zipped_rdd.map(lambda (row,rowId): ([rowId +offset] + list(row)))
    
        return spark.createDataFrame(new_rdd, new_schema)
    

    还创建了一个jira以在Spark本机中添加此功能:

    因为Spark 1.6有一个名为单调递增id()的函数
    它为每一行生成一个具有唯一64位单调索引的新列
    但这并不重要,每个分区都会启动一个新的范围,因此我们必须在使用它之前计算每个分区的偏移量。
    为了提供一个“无rdd”的解决方案,我最终得到了一些collect(),但它只收集偏移量,每个分区一个值,因此不会导致OOM

    def zipWithIndex(df: DataFrame, offset: Long = 1, indexName: String = "index") = {
        val dfWithPartitionId = df.withColumn("partition_id", spark_partition_id()).withColumn("inc_id", monotonically_increasing_id())
    
        val partitionOffsets = dfWithPartitionId
            .groupBy("partition_id")
            .agg(count(lit(1)) as "cnt", first("inc_id") as "inc_id")
            .orderBy("partition_id")
            .select(sum("cnt").over(Window.orderBy("partition_id")) - col("cnt") - col("inc_id") + lit(offset) as "cnt" )
            .collect()
            .map(_.getLong(0))
            .toArray
            
         dfWithPartitionId
            .withColumn("partition_offset", udf((partitionId: Int) => partitionOffsets(partitionId), LongType)(col("partition_id")))
            .withColumn(indexName, col("partition_offset") + col("inc_id"))
            .drop("partition_id", "partition_offset", "inc_id")
    }
    此解决方案不重新打包原始行,也不重新分区原始的大型数据帧,因此在现实世界中速度相当快: 200 GB的CSV数据(4300万行,150列)在2分钟内读取、索引并打包到240芯的拼花地板上
    在测试我的解决方案后,我已经运行了,速度慢了20秒
    您可能想要或不想要使用
    dfWithPartitionId.cache()
    ,这取决于任务@Evgeny,这很有趣。请注意,当您有空分区时,会出现一个bug(数组缺少这些分区索引,至少在spark 1.6中是这样),因此我将数组转换为一个映射(partitionId->offset)

    另外,我去掉了单调递增的源,使每个分区中的“inc_id”从0开始

    以下是更新版本:

    import org.apache.spark.sql.catalyst.expressions.LeafExpression
    import org.apache.spark.sql.catalyst.InternalRow
    import org.apache.spark.sql.types.LongType
    import org.apache.spark.sql.catalyst.expressions.Nondeterministic
    import org.apache.spark.sql.catalyst.expressions.codegen.GeneratedExpressionCode
    import org.apache.spark.sql.catalyst.expressions.codegen.CodeGenContext
    import org.apache.spark.sql.types.DataType
    import org.apache.spark.sql.DataFrame
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.Column
    import org.apache.spark.sql.expressions.Window
    
    case class PartitionMonotonicallyIncreasingID() extends LeafExpression with Nondeterministic {
    
      /**
       * From org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID
       *
       * Record ID within each partition. By being transient, count's value is reset to 0 every time
       * we serialize and deserialize and initialize it.
       */
      @transient private[this] var count: Long = _
    
      override protected def initInternal(): Unit = {
        count = 1L // notice this starts at 1, not 0 as in org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID
      }
    
      override def nullable: Boolean = false
    
      override def dataType: DataType = LongType
    
      override protected def evalInternal(input: InternalRow): Long = {
        val currentCount = count
        count += 1
        currentCount
      }
    
      override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = {
        val countTerm = ctx.freshName("count")
        ctx.addMutableState(ctx.JAVA_LONG, countTerm, s"$countTerm = 1L;")
        ev.isNull = "false"
        s"""
          final ${ctx.javaType(dataType)} ${ev.value} = $countTerm;
          $countTerm++;
        """
      }
    }
    
    object DataframeUtils {
      def zipWithIndex(df: DataFrame, offset: Long = 0, indexName: String = "index") = {
        // from https://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex)
        val dfWithPartitionId = df.withColumn("partition_id", spark_partition_id()).withColumn("inc_id", new Column(PartitionMonotonicallyIncreasingID()))
    
        // collect each partition size, create the offset pages
        val partitionOffsets: Map[Int, Long] = dfWithPartitionId
          .groupBy("partition_id")
          .agg(max("inc_id") as "cnt") // in each partition, count(inc_id) is equal to max(inc_id) (I don't know which one would be faster)
          .select(col("partition_id"), sum("cnt").over(Window.orderBy("partition_id")) - col("cnt") + lit(offset) as "cnt")
          .collect()
          .map(r => (r.getInt(0) -> r.getLong(1)))
          .toMap
    
        def partition_offset(partitionId: Int): Long = partitionOffsets(partitionId)
        val partition_offset_udf = udf(partition_offset _)
        // and re-number the index
        dfWithPartitionId
          .withColumn("partition_offset", partition_offset_udf(col("partition_id")))
          .withColumn(indexName, col("partition_offset") + col("inc_id"))
          .drop("partition_id")
          .drop("partition_offset")
          .drop("inc_id")
      }
    }
    

    Spark Java API版本:

    我已经实现了@Evgeny,用于在Java中的数据帧上执行zipWithIndex,并希望共享代码

    它还包含@fylb在his中提供的改进。我可以为Spark 2.4确认,当Spark_partition_id()返回的条目不是以0开头或不是按顺序增加时,执行失败。由于该函数是不确定的,因此很可能出现上述情况之一。一个例子是通过增加分区计数触发的

    java实现如下所示:

    public static Dataset<Row> zipWithIndex(Dataset<Row> df, Long offset, String indexName) {
            Dataset<Row> dfWithPartitionId = df
                    .withColumn("partition_id", spark_partition_id())
                    .withColumn("inc_id", monotonically_increasing_id());
    
            Object partitionOffsetsObject = dfWithPartitionId
                    .groupBy("partition_id")
                    .agg(count(lit(1)).alias("cnt"), first("inc_id").alias("inc_id"))
                    .orderBy("partition_id")
                    .select(col("partition_id"), sum("cnt").over(Window.orderBy("partition_id")).minus(col("cnt")).minus(col("inc_id")).plus(lit(offset).alias("cnt")))
                    .collect();
            Row[] partitionOffsetsArray = ((Row[]) partitionOffsetsObject);
            Map<Integer, Long> partitionOffsets = new HashMap<>();
            for (int i = 0; i < partitionOffsetsArray.length; i++) {
                partitionOffsets.put(partitionOffsetsArray[i].getInt(0), partitionOffsetsArray[i].getLong(1));
            }
    
            UserDefinedFunction getPartitionOffset = udf(
                    (partitionId) -> partitionOffsets.get((Integer) partitionId), DataTypes.LongType
            );
    
            return dfWithPartitionId
                    .withColumn("partition_offset", getPartitionOffset.apply(col("partition_id")))
                    .withColumn(indexName, col("partition_offset").plus(col("inc_id")))
                    .drop("partition_id", "partition_offset", "inc_id");
        }
    
    公共静态数据集zipWithIndex(数据集df、长偏移量、字符串indexName){
    数据集dfWithPartitionId=df
    .withColumn(“分区id”,spark\u分区id())
    .withColumn(“inc_id”,单调递增的_id());
    对象partitionOffsetsObject=dfWithPartitionId
    .groupBy(“分区id”)
    .agg(count(lit(1)).alias(“cnt”),first(“inc_id”).alias(“inc_id”))
    .orderBy(“分区id”)
    。选择(列(“分区id”)、和(“cnt”)。在(窗口、排序依据(“分区id”))、减(列(“cnt”))、减(列(“inc\U id”)。加上(亮(偏移量)。别名(“cnt”))
    .收集();
    行[]partitionOffsetsArray=((行[])partitionOffsetsObject);
    Map partitionoffset=新的HashMap();
    对于(int i=0;ipartitionOffsets.get((整数)partitionId),DataTypes.LongType
    );
    返回dfWithPartitionId
    .withColumn(“partition\u offset”,getPartitionOffset.apply(col(“partition\u id”))
    .withColumn(索引名,列(“分区偏移”).plus(列(“包含id”))
    .drop(“分区id”、“分区偏移量”、“inc_id”);
    }
    
    我已将@Tagar的版本修改为在Python 3.7上运行,希望与大家分享:

    def dfZipWithIndex (df, offset=1, colName="rowId"):
    '''
        Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe
        and preserves a schema
    
        :param df: source dataframe
        :param offset: adjustment to zipWithIndex()'s index
        :param colName: name of the index column
    '''
    
    new_schema = StructType(
                    [StructField(colName,LongType(),True)]        # new added field in front
                    + df.schema.fields                            # previous schema
                )
    
    zipped_rdd = df.rdd.zipWithIndex()
    
    new_rdd = zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))      # use this for python 3+, tuple gets passed as single argument so using args and [] notation to read elements within args
    return spark.createDataFrame(new_rdd, new_schema)
    

    以下是我的建议,其优点是:

    • 它不涉及我们的
      数据帧
      内部行
      的任何序列化/反序列化
    • 它的逻辑是最低限度的,只依赖于
      RDD.zipWithIndex
    其主要缺点是:

    • 直接从非JVMAPI(pySpark、SparkR)使用它是不可能的
    • 它必须位于org.apache.spark.sql的
      包下
    进口:

    import org.apache.spark.rdd.rdd
    导入org.apache.spark.sql.catalyst.InternalRow
    导入org.apache.spar