Apache spark 在Spark RDD或数据帧中随机洗牌列_Apache Spark_Spark Dataframe

Apache spark 在Spark RDD或数据帧中随机洗牌列

apache-spark

Apache spark 在Spark RDD或数据帧中随机洗牌列,apache-spark,spark-dataframe,Apache Spark,Spark Dataframe,我是否可以洗牌RDD或数据帧的一列，使该列中的条目以随机顺序出现？我不确定我可以使用哪些API来完成这样的任务。虽然不能直接洗牌一列，但可以通过随机RDD在RDD中排列记录仅对单个列进行排列的潜在方法可能是：使用mapPartitions对每个辅助任务执行一些设置/拆卸把所有的记录都存入内存。i、 e.迭代器.toList。确保有多个（/small）数据分区以避免OOME 使用Row对象将所有内容重写为原始内容（给定列除外）在mapPartitions中创建内存中的排序列表对于所需的

我是否可以洗牌RDD或数据帧的一列，使该列中的条目以随机顺序出现？我不确定我可以使用哪些API来完成这样的任务。

虽然不能直接洗牌一列，但可以通过

随机RDD

在

RDD

中排列记录

仅对单个列进行排列的潜在方法可能是：

使用
```
mapPartitions
```
对每个辅助任务执行一些设置/拆卸
把所有的记录都存入内存。i、 e.
```
迭代器.toList
```
。确保有多个（/small）数据分区以避免OOME
使用Row对象将所有内容重写为原始内容（给定列除外）
在mapPartitions中创建内存中的排序列表
对于所需的列，在单独的集合中删除其值，并随机对集合进行采样，以替换每个记录的条目
从
```
mapPartitions
```

您可以添加一个额外的随机生成列，然后根据此随机生成列对记录进行排序。通过这种方式，您将随机洗牌指定的列

这样，您就不需要将所有数据都存储在内存中，这很容易导致OOM。如有必要，Spark将通过溢出到磁盘来处理排序和内存限制问题

如果不需要额外的列，可以在排序后将其删除。

如果不需要对数据进行全局洗牌，可以使用

mapPartitions

方法在分区内洗牌

rdd.mapPartitions(Random.shuffle(_));

对于

pairdd

（类型为

RDD[（K，V）]

）的RDD，如果您有兴趣洗牌键值映射（将任意键映射为任意值）：

末尾的布尔标志表示此操作保留分区（键未更改），以便优化下游操作，例如

reduceByKey

（避免混洗）。

如何选择要混洗的列，

orderBy（rand）

列和

import org.apache.spark.sql.functions.rand
def addIndex（df:DataFrame）=spark.createDataFrame(
//添加索引
df.rdd.zipWithIndex.map{case（r，i）=>Row.fromSeq（r.toSeq:+i）}，
//创建模式
StructType（df.schema.fields:+StructField（“\u index”，LongType，false））
)
案例类条目（名称：字符串，薪资：双倍）
val r1=输入（“最大值”，2001.21）
val r2=条目（“Zhang”，3111.32）
val r3=条目（“Bob”，1919.21）
val r4=条目（“保罗”，3001.5）
val df=addIndex（spark.createDataFrame（序列（r1、r2、r3、r4）））
val df_shuffled=附加索引（df
.选择（列（“薪资”）。作为（“薪资洗牌”））
.orderBy（兰特））
df.join（df_-shuffled，Seq（“_-index”））
.drop（“_索引”）
.show（假）
+-----+-------+---------------+
|姓名|薪水|薪水|洗牌|
+-----+-------+---------------+
|马克斯| 2001.21 | 3001.5|
|张| 3111.32 | 3111.32|
|保罗| 3001.5 | 2001.21|
|鲍勃| 1919.21 | 1919.21|
+-----+-------+---------------+

如果有人正在寻找与Sascha Vetter相当的PySpark，您可以在下面找到：

from pyspark.sql.functions import rand
from pyspark.sql import Row
from pyspark.sql.types import *

def add_index_to_row(row, index):
  print(index)
  row_dict = row.asDict()
  row_dict["index"] = index
  return Row(**row_dict)

def add_index_to_df(df):
  df_with_index = df.rdd.zipWithIndex().map(lambda x: add_index_to_row(x[0], x[1]))
  new_schema = StructType(df.schema.fields + [StructField("index", IntegerType(), True)])
  return spark.createDataFrame(df_with_index, new_schema)

def shuffle_single_column(df, column_name):
  df_cols = df.columns
  # select the desired column and shuffle it (i.e. order it by column with random numbers)
  shuffled_col = df.select(column_name).orderBy(F.rand())
  # add explicit index to the shuffled column
  shuffled_col_index = add_index_to_df(shuffled_col)
  # add explicit index to the original dataframe
  df_index = add_index_to_df(df)
  # drop the desired column from df, join it with the shuffled column on created index and finally drop the index column
  df_shuffled = df_index.drop(column_name).join(shuffled_col_index, "index").drop("index")
  # reorder columns so that the shuffled column comes back to its initial position instead of the last position
  df_shuffled = df_shuffled.select(df_cols)
  return df_shuffled

# initialize random array
z = np.random.randint(20, size=(10, 3)).tolist()
# create the pyspark dataframe
example_df = sc.parallelize(z).toDF(("a","b","c"))
# shuffle one column of the dataframe
example_df_shuffled = shuffle_single_column(df = example_df, column_name = "a")

洗牌一列是什么意思？我想把列中的条目按随机顺序排列。我只想指出那些可能会犯我错误的条目。您不能在此处使用

单调地\u递增\u id

而不是自定义的

addIndex

，因为它将是每个分区的，因此会减少您的数据集。：）这将洗牌整个数据帧。这样做的目的是洗牌一列，剩下的按顺序排列。

from pyspark.sql.functions import rand
from pyspark.sql import Row
from pyspark.sql.types import *

def add_index_to_row(row, index):
  print(index)
  row_dict = row.asDict()
  row_dict["index"] = index
  return Row(**row_dict)

def add_index_to_df(df):
  df_with_index = df.rdd.zipWithIndex().map(lambda x: add_index_to_row(x[0], x[1]))
  new_schema = StructType(df.schema.fields + [StructField("index", IntegerType(), True)])
  return spark.createDataFrame(df_with_index, new_schema)

def shuffle_single_column(df, column_name):
  df_cols = df.columns
  # select the desired column and shuffle it (i.e. order it by column with random numbers)
  shuffled_col = df.select(column_name).orderBy(F.rand())
  # add explicit index to the shuffled column
  shuffled_col_index = add_index_to_df(shuffled_col)
  # add explicit index to the original dataframe
  df_index = add_index_to_df(df)
  # drop the desired column from df, join it with the shuffled column on created index and finally drop the index column
  df_shuffled = df_index.drop(column_name).join(shuffled_col_index, "index").drop("index")
  # reorder columns so that the shuffled column comes back to its initial position instead of the last position
  df_shuffled = df_shuffled.select(df_cols)
  return df_shuffled

# initialize random array
z = np.random.randint(20, size=(10, 3)).tolist()
# create the pyspark dataframe
example_df = sc.parallelize(z).toDF(("a","b","c"))
# shuffle one column of the dataframe
example_df_shuffled = shuffle_single_column(df = example_df, column_name = "a")