Apache spark 数据帧zipWithIndex_Apache Spark_Apache Spark Sql

Apache spark 数据帧zipWithIndex

apache-spark

Apache spark 数据帧zipWithIndex,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我试图解决一个古老的问题，即向数据集中添加序列号。我正在使用数据帧，似乎没有与RDD.zipWithIndex相当的数据帧。另一方面，以下内容或多或少符合我的要求： val origDF = sqlContext.load(...) val seqDF= sqlContext.createDataFrame( origDF.rdd.zipWithIndex.map(ln => Row.fromSeq(Seq(ln._2) ++ ln._1.toSeq)), Str

我试图解决一个古老的问题，即向数据集中添加序列号。我正在使用数据帧，似乎没有与RDD.zipWithIndex相当的数据帧。另一方面，以下内容或多或少符合我的要求：

val origDF = sqlContext.load(...)    

val seqDF= sqlContext.createDataFrame(
    origDF.rdd.zipWithIndex.map(ln => Row.fromSeq(Seq(ln._2) ++ ln._1.toSeq)),
    StructType(Array(StructField("seq", LongType, false)) ++ origDF.schema.fields)
)

在我的实际应用程序中，origDF不会直接从文件中加载——它将通过将2-3个其他数据帧连接在一起创建，并将包含超过1亿行

有更好的方法吗？我能做些什么来优化它呢？

以下内容是代表大卫·格里芬（David Griffin）（编辑无误）发布的

所有的歌唱，所有的舞蹈都是用指数法。您可以设置起始偏移量（默认为1）、索引列名（默认为“id”），并将列放在前面或后面：

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{LongType, StructField, StructType}
import org.apache.spark.sql.Row


def dfZipWithIndex(
  df: DataFrame,
  offset: Int = 1,
  colName: String = "id",
  inFront: Boolean = true
) : DataFrame = {
  df.sqlContext.createDataFrame(
    df.rdd.zipWithIndex.map(ln =>
      Row.fromSeq(
        (if (inFront) Seq(ln._2 + offset) else Seq())
          ++ ln._1.toSeq ++
        (if (inFront) Seq() else Seq(ln._2 + offset))
      )
    ),
    StructType(
      (if (inFront) Array(StructField(colName,LongType,false)) else Array[StructField]()) 
        ++ df.schema.fields ++ 
      (if (inFront) Array[StructField]() else Array(StructField(colName,LongType,false)))
    )
  ) 
}

从Spark 1.5开始，Spark中添加了

窗口

表达式。您现在可以使用

org.apache.spark.sql.expressions.row_number

，而不必将

DataFrame

转换为

RDD

。请注意，我发现上面的

dfZipWithIndex

算法的性能要比下面的算法快得多。但我发布它是因为：

其他人会被诱惑去尝试这个

也许有人可以优化下面的表达式

无论如何，以下是对我有效的方法：

import org.apache.spark.sql.expressions._

df.withColumn("row_num", row_number.over(Window.partitionBy(lit(1)).orderBy(lit(1))))

请注意，我对分区和排序都使用了

lit（1）

，这使得所有内容都在同一个分区中，并且似乎保留了

数据帧的原始顺序，但我认为这是导致其速度减慢的原因
我在一个包含7000000行的4列数据帧上测试了它，它与上面的dfZipWithIndex
之间的速度差异很大（就像我说的，RDD
函数要快得多）。
PySpark版本：
from pyspark.sql.types import LongType, StructField, StructType

def dfZipWithIndex (df, offset=1, colName="rowId"):
    '''
        Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe 
        and preserves a schema

        :param df: source dataframe
        :param offset: adjustment to zipWithIndex()'s index
        :param colName: name of the index column
    '''

    new_schema = StructType(
                    [StructField(colName,LongType(),True)]        # new added field in front
                    + df.schema.fields                            # previous schema
                )

    zipped_rdd = df.rdd.zipWithIndex()

    new_rdd = zipped_rdd.map(lambda (row,rowId): ([rowId +offset] + list(row)))

    return spark.createDataFrame(new_rdd, new_schema)

还创建了一个jira以在Spark本机中添加此功能：
因为Spark 1.6有一个名为单调递增id（）的函数

它为每一行生成一个具有唯一64位单调索引的新列

但这并不重要，每个分区都会启动一个新的范围，因此我们必须在使用它之前计算每个分区的偏移量。

为了提供一个“无rdd”的解决方案，我最终得到了一些collect（），但它只收集偏移量，每个分区一个值，因此不会导致OOM
def zipWithIndex(df: DataFrame, offset: Long = 1, indexName: String = "index") = {
    val dfWithPartitionId = df.withColumn("partition_id", spark_partition_id()).withColumn("inc_id", monotonically_increasing_id())

    val partitionOffsets = dfWithPartitionId
        .groupBy("partition_id")
        .agg(count(lit(1)) as "cnt", first("inc_id") as "inc_id")
        .orderBy("partition_id")
        .select(sum("cnt").over(Window.orderBy("partition_id")) - col("cnt") - col("inc_id") + lit(offset) as "cnt" )
        .collect()
        .map(_.getLong(0))
        .toArray
        
     dfWithPartitionId
        .withColumn("partition_offset", udf((partitionId: Int) => partitionOffsets(partitionId), LongType)(col("partition_id")))
        .withColumn(indexName, col("partition_offset") + col("inc_id"))
        .drop("partition_id", "partition_offset", "inc_id")
}
此解决方案不重新打包原始行，也不重新分区原始的大型数据帧，因此在现实世界中速度相当快：
200 GB的CSV数据（4300万行，150列）在2分钟内读取、索引并打包到240芯的拼花地板上

在测试我的解决方案后，我已经运行了，速度慢了20秒

您可能想要或不想要使用dfWithPartitionId.cache（）
，这取决于任务@Evgeny，这很有趣。请注意，当您有空分区时，会出现一个bug（数组缺少这些分区索引，至少在spark 1.6中是这样），因此我将数组转换为一个映射（partitionId->offset）
另外，我去掉了单调递增的源，使每个分区中的“inc_id”从0开始
以下是更新版本：
import org.apache.spark.sql.catalyst.expressions.LeafExpression
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.catalyst.expressions.Nondeterministic
import org.apache.spark.sql.catalyst.expressions.codegen.GeneratedExpressionCode
import org.apache.spark.sql.catalyst.expressions.codegen.CodeGenContext
import org.apache.spark.sql.types.DataType
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.expressions.Window

case class PartitionMonotonicallyIncreasingID() extends LeafExpression with Nondeterministic {

  /**
   * From org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID
   *
   * Record ID within each partition. By being transient, count's value is reset to 0 every time
   * we serialize and deserialize and initialize it.
   */
  @transient private[this] var count: Long = _

  override protected def initInternal(): Unit = {
    count = 1L // notice this starts at 1, not 0 as in org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID
  }

  override def nullable: Boolean = false

  override def dataType: DataType = LongType

  override protected def evalInternal(input: InternalRow): Long = {
    val currentCount = count
    count += 1
    currentCount
  }

  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = {
    val countTerm = ctx.freshName("count")
    ctx.addMutableState(ctx.JAVA_LONG, countTerm, s"$countTerm = 1L;")
    ev.isNull = "false"
    s"""
      final ${ctx.javaType(dataType)} ${ev.value} = $countTerm;
      $countTerm++;
    """
  }
}

object DataframeUtils {
  def zipWithIndex(df: DataFrame, offset: Long = 0, indexName: String = "index") = {
    // from https://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex)
    val dfWithPartitionId = df.withColumn("partition_id", spark_partition_id()).withColumn("inc_id", new Column(PartitionMonotonicallyIncreasingID()))

    // collect each partition size, create the offset pages
    val partitionOffsets: Map[Int, Long] = dfWithPartitionId
      .groupBy("partition_id")
      .agg(max("inc_id") as "cnt") // in each partition, count(inc_id) is equal to max(inc_id) (I don't know which one would be faster)
      .select(col("partition_id"), sum("cnt").over(Window.orderBy("partition_id")) - col("cnt") + lit(offset) as "cnt")
      .collect()
      .map(r => (r.getInt(0) -> r.getLong(1)))
      .toMap

    def partition_offset(partitionId: Int): Long = partitionOffsets(partitionId)
    val partition_offset_udf = udf(partition_offset _)
    // and re-number the index
    dfWithPartitionId
      .withColumn("partition_offset", partition_offset_udf(col("partition_id")))
      .withColumn(indexName, col("partition_offset") + col("inc_id"))
      .drop("partition_id")
      .drop("partition_offset")
      .drop("inc_id")
  }
}

Spark Java API版本：
我已经实现了@Evgeny，用于在Java中的数据帧上执行zipWithIndex，并希望共享代码
它还包含@fylb在his中提供的改进。我可以为Spark 2.4确认，当Spark_partition_id（）返回的条目不是以0开头或不是按顺序增加时，执行失败。由于该函数是不确定的，因此很可能出现上述情况之一。一个例子是通过增加分区计数触发的
java实现如下所示：
public static Dataset<Row> zipWithIndex(Dataset<Row> df, Long offset, String indexName) {
        Dataset<Row> dfWithPartitionId = df
                .withColumn("partition_id", spark_partition_id())
                .withColumn("inc_id", monotonically_increasing_id());

        Object partitionOffsetsObject = dfWithPartitionId
                .groupBy("partition_id")
                .agg(count(lit(1)).alias("cnt"), first("inc_id").alias("inc_id"))
                .orderBy("partition_id")
                .select(col("partition_id"), sum("cnt").over(Window.orderBy("partition_id")).minus(col("cnt")).minus(col("inc_id")).plus(lit(offset).alias("cnt")))
                .collect();
        Row[] partitionOffsetsArray = ((Row[]) partitionOffsetsObject);
        Map<Integer, Long> partitionOffsets = new HashMap<>();
        for (int i = 0; i < partitionOffsetsArray.length; i++) {
            partitionOffsets.put(partitionOffsetsArray[i].getInt(0), partitionOffsetsArray[i].getLong(1));
        }

        UserDefinedFunction getPartitionOffset = udf(
                (partitionId) -> partitionOffsets.get((Integer) partitionId), DataTypes.LongType
        );

        return dfWithPartitionId
                .withColumn("partition_offset", getPartitionOffset.apply(col("partition_id")))
                .withColumn(indexName, col("partition_offset").plus(col("inc_id")))
                .drop("partition_id", "partition_offset", "inc_id");
    }

公共静态数据集zipWithIndex（数据集df、长偏移量、字符串indexName）{
数据集dfWithPartitionId=df
.withColumn（“分区id”，spark\u分区id（））
.withColumn（“inc_id”，单调递增的_id（））；
对象partitionOffsetsObject=dfWithPartitionId
.groupBy（“分区id”）
.agg（count（lit（1））.alias（“cnt”），first（“inc_id”）.alias（“inc_id”））
.orderBy（“分区id”）
。选择（列（“分区id”）、和（“cnt”）。在（窗口、排序依据（“分区id”））、减（列（“cnt”））、减（列（“inc\U id”）。加上（亮（偏移量）。别名（“cnt”））
.收集（）；
行[]partitionOffsetsArray=（（行[]）partitionOffsetsObject）；
Map partitionoffset=新的HashMap（）；
对于（int i=0；ipartitionOffsets.get（（整数）partitionId），DataTypes.LongType
);
返回dfWithPartitionId
.withColumn（“partition\u offset”，getPartitionOffset.apply（col（“partition\u id”））
.withColumn（索引名，列（“分区偏移”）.plus（列（“包含id”））
.drop（“分区id”、“分区偏移量”、“inc_id”）；
}
我已将@Tagar的版本修改为在Python 3.7上运行，希望与大家分享：
def dfZipWithIndex (df, offset=1, colName="rowId"):
'''
    Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe
    and preserves a schema

    :param df: source dataframe
    :param offset: adjustment to zipWithIndex()'s index
    :param colName: name of the index column
'''

new_schema = StructType(
                [StructField(colName,LongType(),True)]        # new added field in front
                + df.schema.fields                            # previous schema
            )

zipped_rdd = df.rdd.zipWithIndex()

new_rdd = zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))      # use this for python 3+, tuple gets passed as single argument so using args and [] notation to read elements within args
return spark.createDataFrame(new_rdd, new_schema)

以下是我的建议，其优点是：

它不涉及我们的数据帧内部行的任何序列化/反序列化

它的逻辑是最低限度的，只依赖于RDD.zipWithIndex

其主要缺点是：

直接从非JVMAPI（pySpark、SparkR）使用它是不可能的
它必须位于org.apache.spark.sql的包下


进口：
import org.apache.spark.rdd.rdd
导入org.apache.spark.sql.catalyst.InternalRow
导入org.apache.spar