Pyspark sampleBy-每组至少获取一个样本_Pyspark_Grouping_Sample_Representation

Pyspark sampleBy-每组至少获取一个样本

pyspark

Pyspark sampleBy-每组至少获取一个样本,pyspark,grouping,sample,representation,Pyspark,Grouping,Sample,Representation,我在Pypark进行分层抽样。目标是从所有组中获得具有同等代表性的验证样本，以手动验证结果我遇到的问题是，每个组的记录数非常不同（有些可能有数百个，而另一些可能只有少数或只有一个）。这导致了一个问题，即对于低分数组，样本少的组可能根本无法选择任何案例。即使它会取消该特定群体的代表性百分比（这将比其他案例更多的群体高得多），我更看重所有群体的代表性，而不是保持一致的代表性百分比有没有办法强迫抽样者在每组中随机选择至少一个样本下面是一些代码和结果，我将举例说明我的问题： # Calculate

我在Pypark进行分层抽样。目标是从所有组中获得具有同等代表性的验证样本，以手动验证结果

我遇到的问题是，每个组的记录数非常不同（有些可能有数百个，而另一些可能只有少数或只有一个）。这导致了一个问题，即对于低分数组，样本少的组可能根本无法选择任何案例。即使它会取消该特定群体的代表性百分比（这将比其他案例更多的群体高得多），我更看重所有群体的代表性，而不是保持一致的代表性百分比

有没有办法强迫抽样者在每组中随机选择至少一个样本

下面是一些代码和结果，我将举例说明我的问题：

# Calculate the percentage of samples to be taken for the validation
representative_sample = 383     # Sample size for a confidence interval of 95%
population = df.count()
fraction = representative_sample / population

# generate the dictionary with the groups and the fraction of samples to take - same for all groups
fractions = df.select('Group').distinct().rdd.map(lambda x: (x[0], fraction)).collectAsMap() 

# Get the samples
sample = df.sampleBy('Group', fractions)

###################################################
# Check results
print(f'Representative sample: {representative_sample}\tPopulation: {population}\tFraction: {fraction}')
print(f'Number of samples: {sample.count()}')
print(f'Number of distinct groups in the dataframe: {df.select("Group").distinct().count()}')
print(f'Number of distinct groups in the sample: {sample.select("Group").distinct().count()}')

 sample_byEvent = sample.groupBy('Group').count().withColumnRenamed('count', 'SampledCases')
    sample_byEvent = sample_byEvent.withColumn('sampledCasesOverSample', sample_byEvent['SampledCases']/representative_sample)
    sample_byEvent = sample_byEvent.join(df.groupBy('Group').count().withColumnRenamed('count', 'TotalCases').select('Group', 'TotalCases')
               , on='Group', how='outer')
    sample_byEvent = sample_byEvent.withColumn('TotalCasesOverPop', sample_byEvent['TotalCases']/population)
    sample_byEvent = sample_byEvent.withColumn('SampledCasesOverPop', sample_byEvent['SampledCases'] / sample_byEvent['TotalCases'])
##################################################

以下是前面代码的输出：

Representative sample: 383  Population: 9883    Fraction: 0.038753414954973184
Number of samples: 378
Number of distinct groups in the dataframe: 23    
Number of distinct events in the sample: 20
+-------+------------+----------------------+----------+------------------+
|Group  |SampledCases|sampledCasesOverSample|TotalCases|TotalCasesOverPop |
+-------+------------+----------------------+----------+------------------+
|Group1 |25          |0.0652                |611       |0.0618            |
|Group2 |2           |0.0052                |52        |0.0052            |
|Group3 |1           |0.0026                |4         |4.0473E-4         | <= What I would like to always happen
|Group4 |85          |0.2219                |2080      |0.2104            |
|Group5 |26          |0.0678                |632       |0.0639            |
|Group6 |5           |0.0130                |246       |0.0248            |
|Group7 |10          |0.0261                |184       |0.0186            |
|Group8 |null        |null                  |1         |1.0118E-4         | <= problematic
|Group9 |null        |null                  |22        |0.0022            | <= problematic
|Group10|null        |null                  |25        |0.0025            | <= problematic
|Group11|1           |0.0026                |26        |0.0026            |
+-------+------------+----------------------+----------+------------------+

代表性样本：383总体：9883分数：0.038753414954973184
样本数目：378
数据帧中的不同组数：23
样本中的不同事件数：20
+-------+------------+----------------------+----------+------------------+
|组|样本案例|样本案例SOVERSAMPLE |总体案例|总体案例SOVERPOP|
+-------+------------+----------------------+----------+------------------+
|第1组| 25 | 0.0652 | 611 | 0.0618|
|第2组| 2 | 0.0052 | 52 | 0.0052|
|第3组| 1 | 0.0026 | 4 | 4.0473E-4|