Pyspark sampleBy-每组至少获取一个样本
我在Pypark进行分层抽样。目标是从所有组中获得具有同等代表性的验证样本,以手动验证结果 我遇到的问题是,每个组的记录数非常不同(有些可能有数百个,而另一些可能只有少数或只有一个)。这导致了一个问题,即对于低分数组,样本少的组可能根本无法选择任何案例。即使它会取消该特定群体的代表性百分比(这将比其他案例更多的群体高得多),我更看重所有群体的代表性,而不是保持一致的代表性百分比 有没有办法强迫抽样者在每组中随机选择至少一个样本 下面是一些代码和结果,我将举例说明我的问题:Pyspark sampleBy-每组至少获取一个样本,pyspark,grouping,sample,representation,Pyspark,Grouping,Sample,Representation,我在Pypark进行分层抽样。目标是从所有组中获得具有同等代表性的验证样本,以手动验证结果 我遇到的问题是,每个组的记录数非常不同(有些可能有数百个,而另一些可能只有少数或只有一个)。这导致了一个问题,即对于低分数组,样本少的组可能根本无法选择任何案例。即使它会取消该特定群体的代表性百分比(这将比其他案例更多的群体高得多),我更看重所有群体的代表性,而不是保持一致的代表性百分比 有没有办法强迫抽样者在每组中随机选择至少一个样本 下面是一些代码和结果,我将举例说明我的问题: # Calculate
# Calculate the percentage of samples to be taken for the validation
representative_sample = 383 # Sample size for a confidence interval of 95%
population = df.count()
fraction = representative_sample / population
# generate the dictionary with the groups and the fraction of samples to take - same for all groups
fractions = df.select('Group').distinct().rdd.map(lambda x: (x[0], fraction)).collectAsMap()
# Get the samples
sample = df.sampleBy('Group', fractions)
###################################################
# Check results
print(f'Representative sample: {representative_sample}\tPopulation: {population}\tFraction: {fraction}')
print(f'Number of samples: {sample.count()}')
print(f'Number of distinct groups in the dataframe: {df.select("Group").distinct().count()}')
print(f'Number of distinct groups in the sample: {sample.select("Group").distinct().count()}')
sample_byEvent = sample.groupBy('Group').count().withColumnRenamed('count', 'SampledCases')
sample_byEvent = sample_byEvent.withColumn('sampledCasesOverSample', sample_byEvent['SampledCases']/representative_sample)
sample_byEvent = sample_byEvent.join(df.groupBy('Group').count().withColumnRenamed('count', 'TotalCases').select('Group', 'TotalCases')
, on='Group', how='outer')
sample_byEvent = sample_byEvent.withColumn('TotalCasesOverPop', sample_byEvent['TotalCases']/population)
sample_byEvent = sample_byEvent.withColumn('SampledCasesOverPop', sample_byEvent['SampledCases'] / sample_byEvent['TotalCases'])
##################################################
以下是前面代码的输出:
Representative sample: 383 Population: 9883 Fraction: 0.038753414954973184
Number of samples: 378
Number of distinct groups in the dataframe: 23
Number of distinct events in the sample: 20
+-------+------------+----------------------+----------+------------------+
|Group |SampledCases|sampledCasesOverSample|TotalCases|TotalCasesOverPop |
+-------+------------+----------------------+----------+------------------+
|Group1 |25 |0.0652 |611 |0.0618 |
|Group2 |2 |0.0052 |52 |0.0052 |
|Group3 |1 |0.0026 |4 |4.0473E-4 | <= What I would like to always happen
|Group4 |85 |0.2219 |2080 |0.2104 |
|Group5 |26 |0.0678 |632 |0.0639 |
|Group6 |5 |0.0130 |246 |0.0248 |
|Group7 |10 |0.0261 |184 |0.0186 |
|Group8 |null |null |1 |1.0118E-4 | <= problematic
|Group9 |null |null |22 |0.0022 | <= problematic
|Group10|null |null |25 |0.0025 | <= problematic
|Group11|1 |0.0026 |26 |0.0026 |
+-------+------------+----------------------+----------+------------------+
代表性样本:383总体:9883分数:0.038753414954973184
样本数目:378
数据帧中的不同组数:23
样本中的不同事件数:20
+-------+------------+----------------------+----------+------------------+
|组|样本案例|样本案例SOVERSAMPLE |总体案例|总体案例SOVERPOP|
+-------+------------+----------------------+----------+------------------+
|第1组| 25 | 0.0652 | 611 | 0.0618|
|第2组| 2 | 0.0052 | 52 | 0.0052|
|第3组| 1 | 0.0026 | 4 | 4.0473E-4|