Python 基于UNSW-NB15数据集输出的Keras分类模型

Python 基于UNSW-NB15数据集输出的Keras分类模型,python,tensorflow,keras,Python,Tensorflow,Keras,使用神经网络和一个热编码标签(合法、模糊、分析、后门等),我试图对网络连接进行分类。原始数据集包括2 540 047个连接及其分类。我已从原始数据集中删除了IP地址和端口。 培训数据:2100000,验证数据220000,测试数据220047 原件: srcip,sport,dstip,dsport,proto,state,dur,sbytes,dbytes,sttl,dttl,sloss,dloss,service,Sload,Dload,Spkts,Dpkts,swin,dwin,stcpb

使用神经网络和一个热编码标签(合法、模糊、分析、后门等),我试图对网络连接进行分类。原始数据集包括2 540 047个连接及其分类。我已从原始数据集中删除了IP地址和端口。 培训数据:2100000,验证数据220000,测试数据220047

原件:

srcip,sport,dstip,dsport,proto,state,dur,sbytes,dbytes,sttl,dttl,sloss,dloss,service,Sload,Dload,Spkts,Dpkts,swin,dwin,stcpb,dtcpb,smeansz,dmeansz,trans_depth,res_bdy_len,Sjit,Djit,Stime,Ltime,Sintpkt,Dintpkt,tcprtt,synack,ackdat,is_sm_ips_ports,ct_state_ttl,ct_flw_http_mthd,is_ftp_login,ct_ftp_cmd,ct_srv_src,ct_srv_dst,ct_dst_ltm,ct_src_ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,attack_cat,Label
59.166.0.0,1390,149.171.126.6,53,udp,CON,0.001055,132,164,31,29,0,0,dns,500473.9375,621800.9375,2,2,0,0,0,0,66,82,0,0,0,0,1421927414,1421927414,0.017,0.013,0,0,0,0,0,0,0,0,3,7,1,3,1,1,1,,0
已处理(IP和端口已删除,任何字符串均已CRC32ed并转换为浮点):

我尝试训练的模型如下所示:

model = models.Sequential()
model.add(layers.BatchNormalization())
model.add(layers.Dense(160, activation='relu', input_shape=(len(train_data[0]),)))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

model.compile(optimizer=optimizers.Adam(learning_rate=0.01), loss=losses.categorical_crossentropy, metrics=[metrics.accuracy])

history = model.fit(train_data, train_labels, batch_size=256, epochs=100, callbacks=callbacks_list, validation_data=(validation_data, validation_labels))
然而,输出对我来说很奇怪。有些预测有100%的准确性。这是对第三个历元后模型的预测-96,3%基于测试数据(神经网络从未见过的数据)

这是迄今为止我得到的最好的准确度。 我肯定是做错了什么


EDIT1:链接到

您可以添加标题和数据处理代码吗?你们有可用的数据集样本吗?可能值得一看。@DesmondCheong数据预处理可能是个问题,我只是ML的初学者,90%的教程都关注图像或视觉预测,其中图像只是一个具有0-255值的2D数组。我使用CRC32将“dns”、“http”等字符串转换为浮点数(我还检查可能的重复项)。我认为预处理代码太长,无法在这里发布,这是一个链接。
train_data[0]:
[9.8803210e-01 4.1834772e-01 1.0550000e-03 1.3200000e+02 1.6400000e+02
 3.1000000e+01 2.9000000e+01 0.0000000e+00 0.0000000e+00 5.1144195e-01
 5.0047394e+05 6.2180094e+05 2.0000000e+00 2.0000000e+00 0.0000000e+00
 0.0000000e+00 0.0000000e+00 0.0000000e+00 6.6000000e+01 8.2000000e+01
 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 1.4219274e+09
 1.4219274e+09 1.7000001e-02 1.3000000e-02 0.0000000e+00 0.0000000e+00
 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
 0.0000000e+00 3.0000000e+00 7.0000000e+00 1.0000000e+00 3.0000000e+00
 1.0000000e+00 1.0000000e+00 1.0000000e+00]
train_labels[0]:
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
model = models.Sequential()
model.add(layers.BatchNormalization())
model.add(layers.Dense(160, activation='relu', input_shape=(len(train_data[0]),)))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

model.compile(optimizer=optimizers.Adam(learning_rate=0.01), loss=losses.categorical_crossentropy, metrics=[metrics.accuracy])

history = model.fit(train_data, train_labels, batch_size=256, epochs=100, callbacks=callbacks_list, validation_data=(validation_data, validation_labels))
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[4.2333316e-02 5.8378032e-03 5.7929559e-03 7.0942775e-03 1.6567650e-01
 6.4145941e-01 1.5754247e-02 1.1445084e-01 1.0187953e-03 5.8192987e-04]
[0.29578227 0.4866582  0.0014564  0.00338989 0.02311182 0.09311022
 0.0637349  0.02357885 0.00857235 0.0006052 ]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]