Apache spark pyspark分层抽样_Apache Spark_Pyspark_Apache Spark Sql

Apache spark pyspark分层抽样

apache-spark pyspark

Apache spark pyspark分层抽样,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我有一个SparkDataFrame，它有一列很多零，只有很少的一个（只有0.01%的一）我想取一个随机的子样本，但是分层的子样本-这样它在该列中保持1:0的比率可以在pyspark中执行吗我正在寻找一个非scala解决方案，它基于数据帧而不是基于RDD。我在中建议的解决方案非常简单，可以从scala转换为Python（甚至转换为Java-）尽管如此，我还是要重写它python。让我们首先创建一个玩具数据框：来自pyspark.sql.functions的列表=[（214748183

我有一个Spark

DataFrame

，它有一列很多零，只有很少的一个（只有0.01%的一）
我想取一个随机的子样本，但是分层的子样本-这样它在该列中保持1:0的比率
可以在pyspark中执行吗

我正在寻找一个非scala解决方案，它基于
数据帧而不是基于RDD 。我在中建议的解决方案非常简单，可以从scala转换为Python（甚至转换为Java-）尽管如此，我还是要重写它python。让我们首先创建一个玩具数据框：来自pyspark.sql.functions的列表=[（214748183223355149,1）、（2147481832973010692,1）、（21474818322134870842,1）、（2147481832541023347,1）、（2147481832168206630,1）、（21474818321138211459,1）、（214748183285252566,1）、（2147481832201375938,1）、（2147481832486538879,1）、（2147481832919187908,1）、（214748183919187908,1）、（21474818391187908,1）] df=spark.createDataFrame（列表，[“x1”、“x2”、“x3”]） df.show（） # +----------+----------+---+ #| x1 | x2 | x3| # +----------+----------+---+ # |2147481832| 23355149| 1| # |2147481832| 973010692| 1| # |2147481832|2134870842| 1| # |2147481832| 541023347| 1| # |2147481832|1682206630| 1| # |2147481832|1138211459| 1| # |2147481832| 852202566| 1| # |2147481832| 201375938| 1| # |2147481832| 486538879| 1| # |2147481832| 919187908| 1| # | 214748183| 919187908| 1| # | 214748183| 91187908| 1| # +----------+----------+---+ 如您所见，此数据帧包含12个元素： df.count（） # 12 分发情况如下： df.groupBy（“x1”）.count（）.show（） # +----------+-----+ #| x1 |计数| # +----------+-----+ # |2147481832| 10| # | 214748183| 2| # +----------+-----+ 现在让我们来举个例子：首先，我们将播下种子： seed=12 查找分数和示例的关键点： sections=df.select（“x1”）.distinct（）.withColumn（“fraction”，lit（0.8））.rdd.collectAsMap（）打印（分数） # {2147481832: 0.8, 214748183: 0.8} 采样的_df=df.stat.sampleBy（“x1”，分数，种子）采样的_df.show（） # +----------+---------+---+ #| x1 | x2 | x3| # +----------+---------+---+ # |2147481832| 23355149| 1| # |2147481832|973010692| 1| # |2147481832|541023347| 1| # |2147481832|852202566| 1| # |2147481832|201375938| 1| # |2147481832|486538879| 1| # |2147481832|919187908| 1| # | 214748183|919187908| 1| # | 214748183| 91187908| 1| # +----------+---------+---+ 我们现在可以检查样本的内容： sampled_df.count（） # 9 采样的_df.groupBy（“x1”）.count（）.show（） # +----------+-----+ #| x1 |计数| # +----------+-----+ # |2147481832| 7| # | 214748183| 2| # +----------+-----+ 假设“数据”数据框中有titanic数据集，您希望使用基于“存活”目标变量的分层抽样将其拆分为训练集和测试集 # Check initial distributions of 0's and 1's -> data.groupBy("Survived").count().show() Survived|count| +--------+-----+ | 1| 342| | 0| 549 # Taking 70% of both 0's and 1's into training set -> train = data.sampleBy("Survived", fractions={0: 0.7, 1: 0.7}, seed=10) # Subtracting 'train' from original 'data' to get test set -> test = data.subtract(train) # Checking distributions of 0's and 1's in train and test sets after the sampling -> train.groupBy("Survived").count().show() +--------+-----+ |Survived|count| +--------+-----+ | 1| 239| | 0| 399| +--------+-----+ -> test.groupBy("Survived").count().show() +--------+-----+ |Survived|count| +--------+-----+ | 1| 103| | 0| 150| +--------+-----+ 使用PySpark中的“randomSplit”和“union”可以很容易地实现这一点 # read in data df = spark.read.csv(file, header=True) # split dataframes between 0s and 1s zeros = df.filter(df["Target"]==0) ones = df.filter(df["Target"]==1) # split datasets into training and testing train0, test0 = zeros.randomSplit([0.8,0.2], seed=1234) train1, test1 = ones.randomSplit([0.8,0.2], seed=1234) # stack datasets back together train = train0.union(train1) test = test0.union(test1) 这是基于@eliasah和如果您想取回列车和测试集，可以使用以下功能： from pyspark.sql import functions as F def stratified_split_train_test(df, frac, label, join_on, seed=42): """ stratfied split of a dataframe in train and test set. inspiration gotten from: https://stackoverflow.com/a/47672336/1771155 https://stackoverflow.com/a/39889263/1771155""" fractions = df.select(label).distinct().withColumn("fraction", F.lit(frac)).rdd.collectAsMap() df_frac = df.stat.sampleBy(label, fractions, seed) df_remaining = df.join(df_frac, on=join_on, how="left_anti") return df_frac, df_remaining 创建分层训练集和测试集，其中80%的训练集用于训练集 df_train, df_test = stratified_split_train_test(df=df, frac=0.8, label="y", join_on="unique_id") @eliasah有没有办法添加0.8和0.2个分数？我想使用0.8作为训练集，另一个0.2作为测试集。我尝试使用这种方法获得0.8，但在spark 1.6中获得另一个0.2时遇到了困难，在spark 1.6中，没有子查询支持，除了主DF和采样DF上的
，您始终可以使用
@EmmaNej@eliasah是的，但考虑到我有2000万条记录，而且数据集中没有唯一的键，这需要很长时间。@EmmaNej然后@eliasah不幸的是Spark 1.6不支持left_anti-join。我的数据集中没有“unique_id”列。有没有办法重新编写此函数？