Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/288.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 大熊猫按区间分层抽样_Python_Pandas_Dataframe_Pandas Groupby_Sample - Fatal编程技术网

Python 大熊猫按区间分层抽样

Python 大熊猫按区间分层抽样,python,pandas,dataframe,pandas-groupby,sample,Python,Pandas,Dataframe,Pandas Groupby,Sample,我有一个由电影用户评级组成的数据框。.descripe()方法提供以下信息: userId ratings_per_user count 137658.000000 137658.000000 mean 69247.463068 65.745514 std 39977.471244 67.071719 min 1.000000 1.000000 25% 34628.250000 22.0000

我有一个由电影用户评级组成的数据框。
.descripe()
方法提供以下信息:

         userId         ratings_per_user
count   137658.000000   137658.000000
mean    69247.463068    65.745514
std     39977.471244    67.071719
min     1.000000        1.000000
25%     34628.250000    22.000000
50%     69249.500000    41.000000
75%     103868.750000   84.000000
max     138493.000000   462.000000
现在,我想从每个四分位中抽取X个用户:

X users with number of votes between min and 25%
X users with number of votes between 25% and 50%
X users with number of votes between 50% and 75%
X users with number of votes between 75% and max
最后是一个大小为4X的
userId
s列表

到目前为止,我所做的是一个笨拙的代码,它根据四分位数将数据帧分解为4个不同的数据帧,并从每个数据帧中抽取X个用户,然后合并生成的数据帧。但如果可能的话,我想要一个更简单(更快)的解决方案

编辑:

更好的解决方案:

#define total sample size desired
N = 100

ratings_user = ratings.groupby(['userId']).size().reset_index(name='ratings_per_user')

ratings_user['categ'] = np.where(ratings_user['ratings_per_user']>=84.0, 'A', 
                        np.where(ratings_user['ratings_per_user']>=41.0, 'B', 
                        np.where(ratings_user['ratings_per_user']>22.0, 'C', 'D'                              
                                )))
ratings = ratings_user.groupby('categ', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(ratings_user))))).sample(frac=1).reset_index(drop=True)