Python 大熊猫按区间分层抽样_Python_Pandas_Dataframe_Pandas Groupby_Sample

Python 大熊猫按区间分层抽样

python pandas dataframe

Python 大熊猫按区间分层抽样,python,pandas,dataframe,pandas-groupby,sample,Python,Pandas,Dataframe,Pandas Groupby,Sample,我有一个由电影用户评级组成的数据框。.descripe（）方法提供以下信息： userId ratings_per_user count 137658.000000 137658.000000 mean 69247.463068 65.745514 std 39977.471244 67.071719 min 1.000000 1.000000 25% 34628.250000 22.0000

我有一个由电影用户评级组成的数据框。

.descripe（）

方法提供以下信息：

         userId         ratings_per_user
count   137658.000000   137658.000000
mean    69247.463068    65.745514
std     39977.471244    67.071719
min     1.000000        1.000000
25%     34628.250000    22.000000
50%     69249.500000    41.000000
75%     103868.750000   84.000000
max     138493.000000   462.000000

现在，我想从每个四分位中抽取X个用户：

X users with number of votes between min and 25%
X users with number of votes between 25% and 50%
X users with number of votes between 50% and 75%
X users with number of votes between 75% and max

最后是一个大小为4X的

userId

s列表

到目前为止，我所做的是一个笨拙的代码，它根据四分位数将数据帧分解为4个不同的数据帧，并从每个数据帧中抽取X个用户，然后合并生成的数据帧。但如果可能的话，我想要一个更简单（更快）的解决方案

编辑：

更好的解决方案：

#define total sample size desired
N = 100

ratings_user = ratings.groupby(['userId']).size().reset_index(name='ratings_per_user')

ratings_user['categ'] = np.where(ratings_user['ratings_per_user']>=84.0, 'A', 
                        np.where(ratings_user['ratings_per_user']>=41.0, 'B', 
                        np.where(ratings_user['ratings_per_user']>22.0, 'C', 'D'                              
                                )))
ratings = ratings_user.groupby('categ', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(ratings_user))))).sample(frac=1).reset_index(drop=True)