Python 如何安排多变量分类的数据集?
我的疑问是我应该如何为多元逻辑回归准备我的培训和标签数据,我试图在网上找到,但大多数教程都使用任何做这项工作的库 因此,如果我的数据集如下所示:Python 如何安排多变量分类的数据集?,python,tensorflow,deep-learning,classification,logistic-regression,Python,Tensorflow,Deep Learning,Classification,Logistic Regression,我的疑问是我应该如何为多元逻辑回归准备我的培训和标签数据,我试图在网上找到,但大多数教程都使用任何做这项工作的库 因此,如果我的数据集如下所示: input_data labels [ 'aa' , 'bb' ,'cc' ,'dd' , 'ee' ] ['n1' ,'n5'] ['rr' , 'ff' , 'gg' , 'hh' , 'ii
input_data labels
[ 'aa' , 'bb' ,'cc' ,'dd' , 'ee' ] ['n1' ,'n5']
['rr' , 'ff' , 'gg' , 'hh' , 'ii' ,'jj'] ['g1', 'g5']
['kk' , 'll' , 'mm' , 'nn' , 'oo' , 'pp] ['y1','y2','y3']
['qq','rr','ss','tt','uu'vv','ww'] ['y1','y2','z1','z2']
probs = tf.nn.softmax(logits)
preds = tf.argmax(probs, axis=-1) #which gives the max probality
我建立了词汇表:
#building vocabulary
vocabulary = {'bb': 1, 'ff': 6, 'll': 12, 'hh': 8, 'rr': 18, 'tt': 20, 'gg': 7, 'vv': 22, 'jj': 10, 'nn': 14, 'qq': 17, 'kk': 11, 'cc': 2, 'mm': 13, 'ee': 4, 'ww': 23, 'ii': 9, 'oo': 15, 'ss': 19, 'uu': 21, 'pp': 16, 'aa': 0, 'dd': 3}
#building all labels list
labels =['y3', 'n1', 'g1', 'g5', 'y1', 'y2', 'n5', 'z1', 'z2']
现在,下一步我将填充数据:
# doing padding
[0, 1, 2, 3, 4,0 ,0 ]
[18, 6, 7, 8, 9, 10,0]
[11, 12, 13, 14, 15, 16,0]
[17, 18, 19, 20, 21, 22, 23,0]
一切都好了,现在的困惑是我如何将我的标签输入到神经网络,每个输入都有多个类
我是否应该使用一种热编码方法:
padded input_data one_hot labels
[0, 1, 2, 3, 4, 0, 0] [0, 1, 0, 0, 0, 0, 1, 0, 0] # ['n1' ,'n5']
[18, 6, 7, 8, 9, 10,0] [0, 0, 1, 1, 0, 0, 0, 0, 0] # ['g1', 'g5']
[11, 12, 13, 14, 15, 16,0] [1, 0, 0, 0, 1, 1, 0, 0, 0] # ['y1','y2','y3']
[17, 18, 19, 20, 21, 22, 23,0] [0, 0, 0, 0, 1, 1, 0, 1, 1] #['y1','y2','z1','z2']
或者第二种方法是
[[0, 1, 0, 0, 0, 0, 0, 0, 0] , [0, 0, 0, 0, 0, 0, 1, 0, 0]] # [ ['n1'] , ['n5'] ]
[ [0, 0, 1, 0, 0, 0, 0, 0, 0] , [0, 0, 0, 1, 0, 0, 0, 0, 0] ] # [ ['g1'] , ['g5'] ]
[ [1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0] ] # [ ['y1'] , ['y2'] ,['y3]]
[ [0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1]] # [ ['y1'],['y2'],['z1'],['z2']]
或指数法
[0, 1, 2, 3, 4, 0, 0] [1, 6]
[18, 6, 7, 8, 9, 10,0] [2, 3]
[11, 12, 13, 14, 15, 16,0] [4, 5, 0]
[17, 18, 19, 20, 21, 22, 23,0] [4, 5, 7, 8]
对于单个分类,我通常从概率分布中获取argmax,如下所示:
input_data labels
[ 'aa' , 'bb' ,'cc' ,'dd' , 'ee' ] ['n1' ,'n5']
['rr' , 'ff' , 'gg' , 'hh' , 'ii' ,'jj'] ['g1', 'g5']
['kk' , 'll' , 'mm' , 'nn' , 'oo' , 'pp] ['y1','y2','y3']
['qq','rr','ss','tt','uu'vv','ww'] ['y1','y2','z1','z2']
probs = tf.nn.softmax(logits)
preds = tf.argmax(probs, axis=-1) #which gives the max probality
结果是什么
但在多重分类中,我们将如何从分布中获得结果?为了预测输出,而不是像单标签问题那样仅获取最大值,您需要定义一个阈值,超过该阈值时,所有值都标记为1 下面是一个说明多标签问题的示例:
# Sample inputs
y_a = tf.constant([[0, 1, 0, 0, 1, 0, 0, 1, 0]], tf.float32)
y_b = tf.constant([[0, 1, 0, 0, 0, 0, 0, 0, 0]], tf.float32)
logits = tf.constant([[0.2, 0.8, 0.4, 0.5, 0.8, 0.4, 0.2, 0.8, 0.4]], tf.float32)
# Threshold for classification
thres = 0.7
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=y_a))
loss_1 = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=y_b))
#Probs
probs = tf.nn.sigmoid(logits)
# a threshold of 0.5, tf.round() should help
acc_thresh_0_5 = tf.reduce_mean( tf.cast(tf.equal(tf.round(probs), y_a), tf.float32))
# Variable threshold
binary_logits = tf.where(tf.less(probs, thres), tf.zeros(tf.shape(probs)), tf.ones(tf.shape(probs)))
acc_var_thresh = tf.reduce_mean( tf.cast(tf.equal(tf.round(binary_logits), y_a), tf.float32))
with tf.Session() as sess:
print(loss.eval()) #0.7136336
print(loss_1.eval()) #0.89141136
print(acc_thresh_0_5.eval()) #1.0
print(acc_var_thresh.eval()) #1.0
感谢您的回复,很抱歉混淆问题是多标签,而不是多类,我试图在这里描述我的问题,请回答这是多标签分类,类似于您在上面的链接中所做的,除了我有一个可变阈值。y_a是输入,y_b是标签?但是,当很多人建议使用sigmoid时,为什么要在那里使用softmax?y_a和y_b都是标签,首先是多标签,另一个是单标签,只是为了比较使用reduce_mean计算的损失。好吧,但我仍然没有在那里使用softmax,因为很多人都说在那里使用sigmoid(tf.nn.sigmoid)?