Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/cassandra/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 过采样时如何保持/扩展索引_Python_Pandas_Imbalanced Data_Oversampling_Smote - Fatal编程技术网

Python 过采样时如何保持/扩展索引

Python 过采样时如何保持/扩展索引,python,pandas,imbalanced-data,oversampling,smote,Python,Pandas,Imbalanced Data,Oversampling,Smote,我有这样一个数据帧,我想对列“role”进行过采样(在实际情况中,行/列的数量要比这个最小的示例大得多) 这就是我正在做的: X,y = smote.fit_sample(df,df[['role']]) X role value 0 1 1 1 1 1 2 1 2 3 1 1 4 1 1 5 1 2 6 1 1 7 2 1 8 2 1 [.........] 这是可行的,但问题是我需要保留索引(pop_13

我有这样一个数据帧,我想对列“role”进行过采样(在实际情况中,行/列的数量要比这个最小的示例大得多)

这就是我正在做的:

X,y = smote.fit_sample(df,df[['role']])
X
       role value
0   1   1
1   1   1
2   1   2
3   1   1
4   1   1
5   1   2
6   1   1
7   2   1
8   2   1
[.........]

这是可行的,但问题是我需要保留索引(pop_13vdpn1_site_1等),这可能吗?

首先,您需要处理df并将功能和目标标签拆分为
X_train
y_train

现在您可以进行过采样:

X_train_over, y_train_over = smote.fit_sample(X_train, y_train)
最后从上面的输出创建一个数据帧。比如说,

X = pd.DataFrame(X_train_over, columns=X_train.columns)
y = pd.DataFrame(y_train_over, columns=y_train.columns)

最后,我找到了一个解决方法(可能不是最优的)


下面的步骤应该可以做到这一点

import io
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
示例数据

df = pd.read_csv(io.StringIO("""
role  value
    pop_13vdpn1_site_1 1 1
    pop_13vdpn1_site_1 1 1
    pop_13vdpn1_site_1 1 2
    pop_13vdpn1_site_1 1 1
    pop_13vdpn1_site_1 1 1
    pop_13vdpn1_site_1 1 2
    pop_13vdpn1_site_1 1 1
    pop_13vdpn1_site_1 2 1
    pop_13vdpn1_site_1 2 1
    pop_13vdpn1_site_1 2 1
    pop_13vdpn1_site_2 2 1
    pop_13vdpn1_site_2 2 2
    pop_13vdpn1_site_2 2 1
    pop_13vdpn1_site_2 2 1
    pop_13vdpn1_site_2 2 1
    pop_13vdpn1_site_2 2 1
    pop_13vdpn1_site_2 2 1
    pop_13vdpn1_site_2 2 1
    pop_13vdpn1_site_2 2 1
    pop_13vdpn1_site_3 2 1
    pop_13vdpn1_site_1 1 1
    pop_13vdpn1_site_1 1 1
    pop_13vdpn1_site_1 1 2
    pop_13vdpn1_site_1 1 1
    pop_13vdpn1_site_1 1 1
    pop_13vdpn1_site_1 1 2
    pop_13vdpn1_site_1 1 1
    pop_13vdpn1_site_1 2 1
    pop_13vdpn1_site_1 2 1
    pop_13vdpn1_site_1 2 1
    pop_13vdpn1_site_2 2 1
    pop_13vdpn1_site_2 2 2
    pop_13vdpn1_site_2 2 1
    pop_13vdpn1_site_2 2 1
    pop_13vdpn1_site_2 2 1
    pop_13vdpn1_site_2 2 1
    pop_13vdpn1_site_2 2 1
    pop_13vdpn1_site_2 2 1
    pop_13vdpn1_site_2 2 1
    pop_13vdpn1_site_3 2 1
"""), sep="\s+", engine="python")

df = df.reset_index()
形状应为(40,3):

Smote接受数组,因此我们需要定义x和y值

X_train = np.array(df['role']).reshape(40,1)
y_train = np.array(df['value']).reshape(40,)
打击行动:

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X,y = sm.fit_resample(X_train,y_train)
将给定的
X
y
放入数据帧中:

ndf = pd.DataFrame({'role':X.reshape(68,), 'value':y})
重新制作原始名称

ndf['name'] = ndf['role'].apply(lambda x: 'pop_13vdpn1_site_'+str(x))
看看数据是否更平衡

from collections import Counter
Counter(df['role'])
Counter(ndf['role'])

嗨,Giorgos,但是,如果我这样做,我会得到X和X的NaN值y@psagrera如果不提供
索引
参数,输出是什么?角色值0 1 1有趣!我遇到了一个非常相似的情况,你的建议就是我现在要尝试的。谢谢分享!
ndf = pd.DataFrame({'role':X.reshape(68,), 'value':y})
ndf['name'] = ndf['role'].apply(lambda x: 'pop_13vdpn1_site_'+str(x))
from collections import Counter
Counter(df['role'])
Counter(ndf['role'])