Python 基于DNN的多标记预测_Python_Tensorflow_Deep Learning_Tflearn

Python 基于DNN的多标记预测

python tensorflow deep-learning

Python 基于DNN的多标记预测,python,tensorflow,deep-learning,tflearn,Python,Tensorflow,Deep Learning,Tflearn,我试图预测给定文本的几个标签。它适用于单个标签，但我不知道如何实现多标签预测的置信度评分我有以下非规范化格式的数据： ┌────┬──────────┬────────┐ │ id │ Topic │ Text │ ├────┼──────────┼────────┤ │ 1 │ Apples │ FooBar │ │ 1 │ Oranges │ FooBar │ │ 1 │ Kiwis │ FooBar │ │ 2 │ Potatoes │ BazBak │ │

我试图预测给定文本的几个标签。它适用于单个标签，但我不知道如何实现多标签预测的置信度评分

我有以下非规范化格式的数据：

┌────┬──────────┬────────┐
│ id │  Topic   │  Text  │
├────┼──────────┼────────┤
│  1 │ Apples   │ FooBar │
│  1 │ Oranges  │ FooBar │
│  1 │ Kiwis    │ FooBar │
│  2 │ Potatoes │ BazBak │
│  3 │ Carrot   │ BalBan │
└────┴──────────┴────────┘

每个文本可以指定一个或多个主题。到目前为止，我想出了这个。首先，我准备数据-标记化、stem等

df = #read data from csv
categories = [ "Apples", "Oranges", "Kiwis", "Potatoes", "Carrot"]
words = []
docs = []

for index, row in df.iterrows():
    stems = tokenize_and_stem(row, stemmer)
    words.extend(stems)
    docs.append((stems, row[1]))

# remove duplicates
words = sorted(list(set(words)))

# create training data
training = []
output = []
# create an empty array for our output
output_empty = [0] * len(categories)


for doc in docs:
    # initialize our bag of words(bow) for each document in the list
    bow = []
    # list of tokenized words for the pattern
    token_words = doc[0]

    # create our bag of words array
    for w in words:
        bow.append(1) if w in token_words else bow.append(0)

    output_row = list(output_empty)
    output_row[categories.index(doc[1])] = 1

    # our training set will contain a the bag of words model and the output row that tells which catefory that bow belongs to.
    training.append([bow, output_row])

# shuffle our features and turn into np.array as tensorflow  takes in numpy array
random.shuffle(training)
training = np.array(training)

# trainX contains the Bag of words and train_y contains the label/ category
train_x = list(training[:, 0])
train_y = list(training[:, 1])

接下来，我创建我的培训模型

# reset underlying graph data
tf.reset_default_graph()
# Build neural network
net = tflearn.input_data(shape=[None, len(train_x[0])])
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, len(train_y[0]), activation='softmax')
net = tflearn.regression(net)

# Define model and setup tensorboard
model = tflearn.DNN(net, tensorboard_dir='tflearn_logs')
# Start training (apply gradient descent algorithm)
model.fit(train_x, train_y, n_epoch=1000, batch_size=8, show_metric=True)
model.save('model.tflearn')

之后，我尝试预测我的主题：

df = # read data from excel

for index, row in df.iterrows():
    prediction = model.predict([get_bag_of_words(row[2])])
    return categories[np.argmax(prediction)]

正如您所看到的，我选择了最大值

预测值

，这对单个主题很有效。为了选择多个主题，我需要一些信心分数或其他东西，它可以告诉我什么时候停止，因为我不能盲目地设定一个任意的阈值

有什么建议吗？

不要在输出层上使用softmax激活，而应该使用sigmoid激活。你的损失函数应该仍然是交叉熵。这是多类的关键更改

softmax的问题在于，它在输出上创建了一个概率分布。因此，如果类A和类B都是强表示的，则三个类上的softmax可能会给出类似[0.49,0.49,0.02]的结果，但您更喜欢类似[0.99,0.99,0.01]的结果

sigmoid激活正是这样做的，它将实值Logit（应用变换前最后一层的值）压缩到[0,1]范围（这是使用交叉熵损失函数所必需的）。它独立地为每个输出执行此操作。

我已将

activation='sigmoid'

添加到所有层，并在tflearn.returnal中将我的损失函数定义为

loss='classifical\u crossentropy'

。我仍然没有得到标准化的值，比如

[[8.9157884e-06 9.783313E-01 8.3094416e-03 3.3070598e-02 4.0033931e-01]

sigmoid只在最后一层上是必需的，我不确定您以前使用了什么，但之前的层并不一定需要更改（并不是说在完全连接的网络上使用sigmoid一定是错误的）。虽然这些值都在[0,1]范围内，但它们看起来是正确的。你得到了

[0.00,0.90,0.00,0.03,0.40]

它似乎强烈地预测了第二类，并且对第五类有点不确定。这些值不会被标准化为1，每个值将独立地在[0,1]范围内，并且可以大致视为每个类的置信度。