Python 大熊猫按区间分层抽样
我有一个由电影用户评级组成的数据框。Python 大熊猫按区间分层抽样,python,pandas,dataframe,pandas-groupby,sample,Python,Pandas,Dataframe,Pandas Groupby,Sample,我有一个由电影用户评级组成的数据框。.descripe()方法提供以下信息: userId ratings_per_user count 137658.000000 137658.000000 mean 69247.463068 65.745514 std 39977.471244 67.071719 min 1.000000 1.000000 25% 34628.250000 22.0000
.descripe()
方法提供以下信息:
userId ratings_per_user
count 137658.000000 137658.000000
mean 69247.463068 65.745514
std 39977.471244 67.071719
min 1.000000 1.000000
25% 34628.250000 22.000000
50% 69249.500000 41.000000
75% 103868.750000 84.000000
max 138493.000000 462.000000
现在,我想从每个四分位中抽取X个用户:
X users with number of votes between min and 25%
X users with number of votes between 25% and 50%
X users with number of votes between 50% and 75%
X users with number of votes between 75% and max
最后是一个大小为4X的userId
s列表
到目前为止,我所做的是一个笨拙的代码,它根据四分位数将数据帧分解为4个不同的数据帧,并从每个数据帧中抽取X个用户,然后合并生成的数据帧。但如果可能的话,我想要一个更简单(更快)的解决方案
编辑:
更好的解决方案:
#define total sample size desired
N = 100
ratings_user = ratings.groupby(['userId']).size().reset_index(name='ratings_per_user')
ratings_user['categ'] = np.where(ratings_user['ratings_per_user']>=84.0, 'A',
np.where(ratings_user['ratings_per_user']>=41.0, 'B',
np.where(ratings_user['ratings_per_user']>22.0, 'C', 'D'
)))
ratings = ratings_user.groupby('categ', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(ratings_user))))).sample(frac=1).reset_index(drop=True)