Pyspark sampleBy-每组至少获取一个样本

Pyspark sampleBy-每组至少获取一个样本,pyspark,grouping,sample,representation,Pyspark,Grouping,Sample,Representation,我在Pypark进行分层抽样。目标是从所有组中获得具有同等代表性的验证样本,以手动验证结果 我遇到的问题是,每个组的记录数非常不同(有些可能有数百个,而另一些可能只有少数或只有一个)。这导致了一个问题,即对于低分数组,样本少的组可能根本无法选择任何案例。即使它会取消该特定群体的代表性百分比(这将比其他案例更多的群体高得多),我更看重所有群体的代表性,而不是保持一致的代表性百分比 有没有办法强迫抽样者在每组中随机选择至少一个样本 下面是一些代码和结果,我将举例说明我的问题: # Calculate

我在Pypark进行分层抽样。目标是从所有组中获得具有同等代表性的验证样本,以手动验证结果

我遇到的问题是,每个组的记录数非常不同(有些可能有数百个,而另一些可能只有少数或只有一个)。这导致了一个问题,即对于低分数组,样本少的组可能根本无法选择任何案例。即使它会取消该特定群体的代表性百分比(这将比其他案例更多的群体高得多),我更看重所有群体的代表性,而不是保持一致的代表性百分比

有没有办法强迫抽样者在每组中随机选择至少一个样本

下面是一些代码和结果,我将举例说明我的问题:

# Calculate the percentage of samples to be taken for the validation
representative_sample = 383     # Sample size for a confidence interval of 95%
population = df.count()
fraction = representative_sample / population

# generate the dictionary with the groups and the fraction of samples to take - same for all groups
fractions = df.select('Group').distinct().rdd.map(lambda x: (x[0], fraction)).collectAsMap() 

# Get the samples
sample = df.sampleBy('Group', fractions)

###################################################
# Check results
print(f'Representative sample: {representative_sample}\tPopulation: {population}\tFraction: {fraction}')
print(f'Number of samples: {sample.count()}')
print(f'Number of distinct groups in the dataframe: {df.select("Group").distinct().count()}')
print(f'Number of distinct groups in the sample: {sample.select("Group").distinct().count()}')

 sample_byEvent = sample.groupBy('Group').count().withColumnRenamed('count', 'SampledCases')
    sample_byEvent = sample_byEvent.withColumn('sampledCasesOverSample', sample_byEvent['SampledCases']/representative_sample)
    sample_byEvent = sample_byEvent.join(df.groupBy('Group').count().withColumnRenamed('count', 'TotalCases').select('Group', 'TotalCases')
               , on='Group', how='outer')
    sample_byEvent = sample_byEvent.withColumn('TotalCasesOverPop', sample_byEvent['TotalCases']/population)
    sample_byEvent = sample_byEvent.withColumn('SampledCasesOverPop', sample_byEvent['SampledCases'] / sample_byEvent['TotalCases'])
##################################################
以下是前面代码的输出:

Representative sample: 383  Population: 9883    Fraction: 0.038753414954973184
Number of samples: 378
Number of distinct groups in the dataframe: 23    
Number of distinct events in the sample: 20
+-------+------------+----------------------+----------+------------------+
|Group  |SampledCases|sampledCasesOverSample|TotalCases|TotalCasesOverPop |
+-------+------------+----------------------+----------+------------------+
|Group1 |25          |0.0652                |611       |0.0618            |
|Group2 |2           |0.0052                |52        |0.0052            |
|Group3 |1           |0.0026                |4         |4.0473E-4         | <= What I would like to always happen
|Group4 |85          |0.2219                |2080      |0.2104            |
|Group5 |26          |0.0678                |632       |0.0639            |
|Group6 |5           |0.0130                |246       |0.0248            |
|Group7 |10          |0.0261                |184       |0.0186            |
|Group8 |null        |null                  |1         |1.0118E-4         | <= problematic
|Group9 |null        |null                  |22        |0.0022            | <= problematic
|Group10|null        |null                  |25        |0.0025            | <= problematic
|Group11|1           |0.0026                |26        |0.0026            |
+-------+------------+----------------------+----------+------------------+
代表性样本:383总体:9883分数:0.038753414954973184
样本数目:378
数据帧中的不同组数:23
样本中的不同事件数:20
+-------+------------+----------------------+----------+------------------+
|组|样本案例|样本案例SOVERSAMPLE |总体案例|总体案例SOVERPOP|
+-------+------------+----------------------+----------+------------------+
|第1组| 25 | 0.0652 | 611 | 0.0618|
|第2组| 2 | 0.0052 | 52 | 0.0052|
|第3组| 1 | 0.0026 | 4 | 4.0473E-4|