Python 如何在没有内存错误的情况下对3k类别的变量进行热编码
我是一个热编码的变量,有超过3k的类别,并运行到MemoryError。我还有其他变量,我也是一个热门编码,但它们的类别较少。对于一个可以成功进行热编码的变量,我拥有的最大类别是935 我正在使用以下代码:Python 如何在没有内存错误的情况下对3k类别的变量进行热编码,python,pandas,deep-learning,one-hot-encoding,Python,Pandas,Deep Learning,One Hot Encoding,我是一个热编码的变量,有超过3k的类别,并运行到MemoryError。我还有其他变量,我也是一个热门编码,但它们的类别较少。对于一个可以成功进行热编码的变量,我拥有的最大类别是935 我正在使用以下代码: from sklearn import preprocessing from sklearn.preprocessing import OneHotEncoder def onehot(featurename): onehot_encoder = OneHotEncoder(spa
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
def onehot(featurename):
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(df[featurename].reshape(-1, 1))
trn_onehot_encoded = onehot_encoded[msk]
val_onehot_encoded = onehot_encoded[~msk]
return trn_onehot_encoded, val_onehot_encoded
trn_onehot_encoded_mt, val_onehot_encoded_mt = onehot('modality_type')
trn_onehot_encoded_mr, val_onehot_encoded_mr = onehot('roleid')
trn_onehot_encoded_sub, val_onehot_encoded_sub = onehot('subject')
trn_onehot_encoded_quartile, val_onehot_encoded_quartile = onehot('quartile')
trn_onehot_encoded_country, val_onehot_encoded_country = onehot('country_short')
trn_onehot_encoded_region, val_onehot_encoded_region = onehot('region')
trn_onehot_encoded_groupmemberornot, val_onehot_encoded_groupmemberornot = onehot('groupmemberornot')
trn_onehot_encoded_highlight, val_onehot_encoded_highlight = onehot('highlight_bin_new')
trn_onehot_encoded_note, val_onehot_encoded_note = onehot('note_bin_new')
trn_onehot_encoded_eid, val_onehot_encoded_eid = onehot('new_eid')
我对变量new_eid
进行编码的最后一行代码是我获取MemoryError
或一个死内核的代码
为了尝试解决此错误,我在函数onehot()
中的onehotcoder
中将字段sparse
设置为true
适合Sparse=True
的代码如下:
<All the code above with Sparse=True>
mt = Input(shape=(trn_onehot_encoded_mt.shape[1],))
mr = Input(shape=(trn_onehot_encoded_mr.shape[1],))
sub = Input(shape=(trn_onehot_encoded_sub.shape[1],))
gmon = Input(shape=(trn_onehot_encoded_groupmemberornot.shape[1],))
region = Input(shape=(trn_onehot_encoded_region.shape[1],))
country = Input(shape=(trn_onehot_encoded_country.shape[1],))
highlight = Input(shape=(trn_onehot_encoded_highlight.shape[1],))
note = Input(shape=(trn_onehot_encoded_note.shape[1],))
#Model definition
x = merge([u, a], mode='concat')
x = Flatten()(x)
x = merge([x, mt], mode='concat')
x = merge([x, mr], mode='concat')
x = merge([x, sub], mode='concat')
x = merge([x, gmon], mode='concat')
x = merge([x, region], mode='concat')
x = merge([x, country], mode='concat')
x = merge([x, highlight], mode='concat')
x = merge([x, note], mode='concat')
x = Dense(1000, activation='relu')(x)
BatchNormalization()
Dropout(0.5)
x = Dense(200, activation='relu')(x)
BatchNormalization()
Dropout(0.5)
x = Dense(50, activation='relu')(x)
BatchNormalization()
x = Dense(2, activation='softmax')(x)
nn = Model([user_in, artifact_in, mt, mr, sub, gmon, region, country, highlight, note], x)
nn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
def fit_nn(lr, bs):
nn.optimizer.lr = lr
nn.fit([trn.member_id,
trn.artifact_id,
trn_onehot_encoded_mt,
trn_onehot_encoded_mr,
trn_onehot_encoded_sub,
trn_onehot_encoded_groupmemberornot,
trn_onehot_encoded_region,
trn_onehot_encoded_country,
trn_onehot_encoded_highlight,
trn_onehot_encoded_note], trn_onehot_encoded_quartile,
batch_size=bs,
epochs=1,
validation_data=([val.member_id,
val.artifact_id,
val_onehot_encoded_mt,
val_onehot_encoded_mr,
val_onehot_encoded_sub,
val_onehot_encoded_groupmemberornot,
val_onehot_encoded_region,
val_onehot_encoded_country,
val_onehot_encoded_highlight,
val_onehot_encoded_note], val_onehot_encoded_quartile)
)
bs = 10000
fit_nn(0.001, bs)
mt=输入(形状=(trn\u onehot\u编码的\u mt.shape[1],)
mr=输入(shape=(trn\u onehot\u encoded\u mr.shape[1],)
sub=输入(shape=(trn\u onehot\u encoded\u sub.shape[1],)
gmon=Input(shape=(trn\u onehot\u encoded\u groupmemberornot.shape[1],)
region=输入(shape=(trn\u onehot\u encoded\u region.shape[1],)
country=输入(shape=(trn\u onehot\u encoded\u country.shape[1],)
highlight=输入(shape=(trn\u onehot\u encoded\u highlight.shape[1],)
note=输入(shape=(trn\u onehot\u encoded\u note.shape[1],)
#模型定义
x=合并([u,a],mode='concat')
x=展平()(x)
x=合并([x,mt],mode='concat')
x=合并([x,mr],mode='concat')
x=合并([x,sub],mode='concat')
x=合并([x,gmon],mode='concat')
x=合并([x,区域],模式='concat')
x=合并([x,国家],模式='concat')
x=合并([x,突出显示],模式='concat')
x=合并([x,注意],mode='concat')
x=密度(1000,激活='relu')(x)
BatchNormalization()
辍学(0.5)
x=密度(200,活化='relu')(x)
BatchNormalization()
辍学(0.5)
x=密度(50,活化='relu')(x)
BatchNormalization()
x=密集(2,激活='softmax')(x)
nn=模型([用户输入、工件输入、mt、mr、sub、gmon、地区、国家、突出显示、注释],x)
编译(loss='classifical_crossentropy',optimizer='adam',metrics=['accurity'])
def配件(左后、右后):
nn.optimizer.lr=lr
nn.fit([trn.member\u id,
trn.artifact\u id,
trn\u onehot\u encoded\u mt,
trn_onehot_encoded_先生,
trn\u onehot\u encoded\u sub,
trn\u onehot\u encoded\u groupmemberornot,
trn_onehot_编码区,
trn_onehot_编码_国家,
trn_onehot_encoded_highlight,
trn\u onehot\u编码的\u注]、trn\u onehot\u编码的\u四分位,
批次大小=bs,
纪元=1,
验证_数据=([val.member_id,
val.artifact_id,
val_onehot_mt,
val_onehot先生,
val_onehot_encoded_sub,
val_onehot_encoded_groupmemberornot,
val_onehot_编码区,
瓦卢·奥涅霍特国家,
val_onehot_encoded_highlight,
val_onehot_编码(注),val_onehot_编码(四分位)
)
bs=10000
安装(0.001,bs)
但是,当我尝试拟合模型时,我得到以下错误:
Train on 2116850 samples, validate on 234276 samples
Epoch 1/1
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-32-8ce1b684763f> in <module>()
----> 1 fit_nn(0.001, bs)
<ipython-input-30-3e1be8cadb04> in fit_nn(lr, bs)
23 val_onehot_encoded_country,
24 val_onehot_encoded_highlight,
---> 25 val_onehot_encoded_note], val_onehot_encoded_quartile)
26 )
/home/prateek_dl/anaconda3/lib/python3.5/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
1596 initial_epoch=initial_epoch,
1597 steps_per_epoch=steps_per_epoch,
-> 1598 validation_steps=validation_steps)
1599
1600 def evaluate(self, x, y,
/home/prateek_dl/anaconda3/lib/python3.5/site-packages/keras/engine/training.py in _fit_loop(self, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
1181 batch_logs['size'] = len(batch_ids)
1182 callbacks.on_batch_begin(batch_index, batch_logs)
-> 1183 outs = f(ins_batch)
1184 if not isinstance(outs, list):
1185 outs = [outs]
/home/prateek_dl/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs)
2271 updated = session.run(self.outputs + [self.updates_op],
2272 feed_dict=feed_dict,
-> 2273 **self.session_kwargs)
2274 return updated[:len(self.outputs)]
2275
/home/prateek_dl/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
893 try:
894 result = self._run(None, fetches, feed_dict, options_ptr,
--> 895 run_metadata_ptr)
896 if run_metadata:
897 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
/home/prateek_dl/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1091 feed_handles[subfeed_t] = subfeed_val
1092 else:
-> 1093 np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)
1094
1095 if (not is_tensor_handle_feed and
/home/prateek_dl/anaconda3/lib/python3.5/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
480
481 """
--> 482 return array(a, dtype, copy=False, order=order)
483
484 def asanyarray(a, dtype=None, order=None):
ValueError: setting an array element with a sequence.
对2116850个样本进行训练,对234276个样本进行验证
纪元1/1
---------------------------------------------------------------------------
ValueError回溯(最近一次调用上次)
在()
---->1个配件(0.001,bs)
安装(左后、右后)
23瓦卢·奥涅霍特国家,
24 val_onehot_encoded_高光,
--->25 val_onehot_编码(注),val_onehot_编码(四分位)
26 )
/home/prateek_dl/anaconda3/lib/python3.5/site-packages/keras/engine/training.py in fit(self、x、y、批量大小、历元、详细信息、回调、验证分割、验证数据、混洗、类权重、样本权重、初始历元、每历元的步数、验证步数、**kwargs)
1596初始纪元=初始纪元,
1597步/u历元=步/u历元,
->1598验证步骤=验证步骤)
1599
1600 def评估(自、x、y、,
/home/prateek\u dl/anaconda3/lib/python3.5/site-packages/keras/engine/training.py in\u-fit\u循环(self、f、ins、out\u标签、批量大小、历元、冗余、回调、val\u f、val\u-ins、无序、回调度量、初始历元、每个历元的步骤、验证步骤)
1181批次日志['size']=len(批次ID)
1182回调。在批处理开始时(批处理索引、批处理日志)
->1183 outs=f(批量输入)
1184如果不存在(输出,列表):
1185输出=[输出]
/home/prateek\u dl/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow\u backend.py in\uuuuuu调用(self,输入)
2271 updated=session.run(self.outputs+[self.updates\u op],
2272进刀盘=进刀盘,
->2273**self.session_-kwargs)
2274返回更新[:len(自输出)]
2275
/home/prateek_dl/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py in run(self、fetches、feed_dict、options、run_元数据)
893尝试:
894结果=self.\u运行(无、取数、输入、选项、,
-->895运行(元数据)
896如果运行\u元数据:
897 proto_data=tf_session.tf_GetBuffer(run_metadata_ptr)
/home/prateek_dl/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py in_run(self、handle、fetches、feed_dict、options、run_metadata)
1091进纸手柄[副进纸]=副进纸值
1092其他:
->1093 np_val=np.asarray(子进纸值,数据类型=子进纸类型)
1094
1095如果(非张量)为
/asarray中的home/prateek_dl/anaconda3/lib/python3.5/site-packages/numpy/core/numeric.py(a,数据类型,订单)
480
481 """
--> 482