Python 按列值对数据帧进行采样_Python_Python 3.x_Pandas_Sampling

Python 按列值对数据帧进行采样

python python-3.x pandas

Python 按列值对数据帧进行采样,python,python-3.x,pandas,sampling,Python,Python 3.x,Pandas,Sampling,我有一个熊猫数据框，名为ratings\u full，其形式如下： userID nr_votes 123 12 124 14 234 22 346 35 763 45 238 1 127 17 我想对这个数据帧进行采样，因为它包含数以万计的用户。我想提取100个用户，但要以某种方式优先考虑那些具有较低值的nr_投票，而不只是对他们进行采样。因此，对nr_投票进行一种“分层抽样”。可能吗到目前为止，我所做的就是： SA

我有一个熊猫数据框，名为

ratings\u full

，其形式如下：

userID   nr_votes
123      12
124      14
234      22
346      35
763      45
238      1
127      17

我想对这个数据帧进行采样，因为它包含数以万计的用户。我想提取100个用户，但要以某种方式优先考虑那些具有较低值的

nr_投票

，而不只是对他们进行采样。因此，对

nr_投票进行一种“分层抽样”

。可能吗

到目前为止，我所做的就是：

SAMPLING_FRACTION = 0.0007

uid_samples = ratings_top['userId'] \
                        .drop_duplicates() \
                        .sample(frac=SAMPLING_FRACTION, 
                                replace=False, 
                                random_state=1)
ratings_sample = pd.merge(ratings_full, uid_samples, on='userId', how='inner')

这只提供了跨

用户ID

的随机抽样，但无法确保抽样以某种方式分层

编辑：如果我们能将

nr_投票

分成N个桶，并对桶进行分层抽样，我甚至会很高兴

编辑2我现在正在尝试：

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X=ratings_full.drop([nr_votes], axis=1),
             y=ratings_full.nr_votes, 
             test_size=0.33, 
             random_state=42, 
             stratify=y)

然后我必须把数据帧重新组合起来。这不是一个理想的答案，但可能有效。我甚至会先尝试bucket，并使用bucket列作为我的“标签”。

我们可以通过索引切片来进行

np.random.choice

n = len(ratings_top)
idx = np.random.choice(ratings_top.index.values, p=ratings_top['probability'], size=n*0.0007, replace=True)

然后

您可以使用

qcut

将它们划分为buckets@Roim谢谢，是的，如果必要的话，我可以使用qcut，但分层抽样是个问题。什么是评级_top[‘概率’？@QubixQ你可以通过nr_投票来定义你的概率，比如（1-df.nr_投票/总和（df.nr_投票））请注意，

np.random.choice

要求概率总和为1.0，例如：

prob=（df.nr\u voces.max（）/df.nr\u voces）/df.nr\u voces.sum（）

。另外，

replace

可能应设置为

False

。

sample_df = df.loc[idx].copy()

from sklearn.model_selection import StratifiedShuffleSplit

n_splits = 1 
sss = model_selection.StratifiedShuffleSplit(n_splits=n_splits, 
                                                 test_size=0.1,
                                                 random_state=42)
train_idx, test_idx = list(sss.split(X, y))[0]