Python 标签标记器不工作，无法计算损失和准确性_Python_Tensorflow_Keras_Nlp_Tokenize

Python 标签标记器不工作，无法计算损失和准确性

python tensorflow keras nlp

Python 标签标记器不工作，无法计算损失和准确性,python,tensorflow,keras,nlp,tokenize,Python,Tensorflow,Keras,Nlp,Tokenize,我正在为NLP使用Keras Tensorflow，我目前正在处理imdb评论数据集。我想使用hub.KerasLayer。我想直接传递实际的x和y值。在我的model.fit语句中，句子是x，标签是y。我的代码： import csv import tensorflow as tf import tensorflow_datasets as tfds import numpy as np import tensorflow_hub as hub from tensorflow.keras.pr

我正在为NLP使用Keras Tensorflow，我目前正在处理imdb评论数据集。我想使用hub.KerasLayer。我想直接传递实际的x和y值。在我的model.fit语句中，句子是x，标签是y。我的代码：

import csv
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import tensorflow_hub as hub
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

imdb_train=imdb['train']
imdb_test=imdb['test']

training_sentences=[]
training_labels=[]

test_sentences=[]
test_labels=[]

for a,b in imdb_train:
  training_sentences.append(a.numpy().decode("utf8"))
  training_labels.append(b.numpy())

for a,b in imdb_test:
  test_sentences.append(a.numpy().decode("utf8"))
  test_labels.append(b.numpy())

model = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(model, output_shape=[20], input_shape=[], 
                           dtype=tf.string, trainable=True)

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),optimizer='adam', metrics=[tf.metrics.BinaryAccuracy(threshold=0.0, name='accuracy')])

尝试

history = model.fit(x=training_sentences,
                      y=training_labels,
                      validation_data=(test_sentences, test_labels),
                      epochs=2)

不起作用，因为training_标签的形状/格式不正确。我现在的方法是再次使用标记器，因为我会以正确的格式/形状获得结果（从文本到序列）。为此，我必须首先将其转换为yes/no（或a/b等）字符串

因为我现在有1和2作为标签，所以我需要更新我的模型：

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(2))

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

然后我试着去适应它：

history = model.fit(x=training_sentences,
                      y=test_labels_pad,
                      validation_data=(test_sentences, val_labels_pad),
                      epochs=2)

问题是损耗很小，计算精度不正确

错在哪里

请不要问我的问题是关于这种特定的方式，以及为什么这个标记器不起作用。我知道还有其他可行的办法。

问题似乎有两个方面

首先，二进制目标应该始终是

[0,1]

，而不是

[1,2]

。所以，我从你的目标中减去了一个

Tokenizer（）

不是用来编码标签的，您应该使用

tfds.features.ClassLabel（）

。现在，我只是在

fit（）

调用中减去了1

history = model.fit(x=training_sentences,
                      y=list(map(lambda x: x[0] - 1, test_labels_pad)),
                      validation_data=(test_sentences, 
                                       list(map(lambda x: x[0] - 1, val_labels_pad))),
                      epochs=1)

其次，出于某种原因，您的输入层只返回了

nan

。关于预先培训的模型，他们说：

google/tf2预览/gnews-switle-20dim-with-oov/1

-与

google/tf2预览/gnews-switle-20dim/1

相同，但有2.5%的词汇转换为oov bucket如果任务的词汇表和模型的词汇表没有完全重叠，这会有所帮助

因此，您应该使用第二个，因为您的数据集没有与它所训练的数据完全重叠。然后，您的模型将开始学习

model = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1"
hub_layer = hub.KerasLayer(model, output_shape=[20], input_shape=[],
                           dtype=tf.string, trainable=True)

完整运行代码：

import csv
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import tensorflow_hub as hub
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

imdb_train=imdb['train']
imdb_test=imdb['test']

training_sentences=[]
training_labels=[]

test_sentences=[]
test_labels=[]

for a,b in imdb_train:
  training_sentences.append(a.numpy().decode("utf8"))
  training_labels.append(b.numpy())

for a,b in imdb_test:
  test_sentences.append(a.numpy().decode("utf8"))
  test_labels.append(b.numpy())

training_labels_test = []
for i in training_labels:
    if i == 0: training_labels_test.append("no")
    if i == 1: training_labels_test.append("yes")

testtokenizer = Tokenizer()
testtokenizer.fit_on_texts(training_labels_test)
test_labels_pad = testtokenizer.texts_to_sequences(training_labels_test)

val_labels_test = []
for i in test_labels:
    if i == 0: val_labels_test.append("no")
    if i == 1: val_labels_test.append("yes")

testtokenizer.fit_on_texts(val_labels_test)
val_labels_pad = testtokenizer.texts_to_sequences(val_labels_test)

model = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1"
hub_layer = hub.KerasLayer(model, output_shape=[20], input_shape=[],
                           dtype=tf.string, trainable=True)

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(2))

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

history = model.fit(x=training_sentences,
                      y=list(map(lambda x: x[0] - 1, test_labels_pad)),
                      validation_data=(test_sentences, 
                      list(map(lambda x: x[0] - 1, val_labels_pad))),
                      epochs=1)

model.predict(training_sentences)

看看如果您有3个类别，并使用

[1,2,3]

而不是

[0,1,2]

，会发生什么：

y_true = tf.constant([1, 2, 3])
y_pred = tf.constant([[0.05, 0.95, 0], [0.1, 0.8, 0.1], [.2, .4, .4]])
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy()

y_true = tf.constant([0, 1, 2])
y_pred = tf.constant([[0.05, 0.95, 0], [0.1, 0.8, 0.1], [.2, .4, .4]])
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy()

但是它与

[0,1,2]

一起工作：

y_true = tf.constant([1, 2, 3])
y_pred = tf.constant([[0.05, 0.95, 0], [0.1, 0.8, 0.1], [.2, .4, .4]])
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy()

y_true = tf.constant([0, 1, 2])
y_pred = tf.constant([[0.05, 0.95, 0], [0.1, 0.8, 0.1], [.2, .4, .4]])
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy()

这个问题似乎是双重的

首先，二进制目标应该始终是

[0,1]

，而不是

[1,2]

。所以，我从你的目标中减去了一个

Tokenizer（）

不是用来编码标签的，您应该使用

tfds.features.ClassLabel（）

。现在，我只是在

fit（）

调用中减去了1

history = model.fit(x=training_sentences,
                      y=list(map(lambda x: x[0] - 1, test_labels_pad)),
                      validation_data=(test_sentences, 
                                       list(map(lambda x: x[0] - 1, val_labels_pad))),
                      epochs=1)

其次，出于某种原因，您的输入层只返回了

nan

。关于预先培训的模型，他们说：

google/tf2预览/gnews-switle-20dim-with-oov/1

-与

google/tf2预览/gnews-switle-20dim/1

相同，但有2.5%的词汇转换为oov bucket如果任务的词汇表和模型的词汇表没有完全重叠，这会有所帮助

因此，您应该使用第二个，因为您的数据集没有与它所训练的数据完全重叠。然后，您的模型将开始学习

model = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1"
hub_layer = hub.KerasLayer(model, output_shape=[20], input_shape=[],
                           dtype=tf.string, trainable=True)

完整运行代码：

import csv
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import tensorflow_hub as hub
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

imdb_train=imdb['train']
imdb_test=imdb['test']

training_sentences=[]
training_labels=[]

test_sentences=[]
test_labels=[]

for a,b in imdb_train:
  training_sentences.append(a.numpy().decode("utf8"))
  training_labels.append(b.numpy())

for a,b in imdb_test:
  test_sentences.append(a.numpy().decode("utf8"))
  test_labels.append(b.numpy())

training_labels_test = []
for i in training_labels:
    if i == 0: training_labels_test.append("no")
    if i == 1: training_labels_test.append("yes")

testtokenizer = Tokenizer()
testtokenizer.fit_on_texts(training_labels_test)
test_labels_pad = testtokenizer.texts_to_sequences(training_labels_test)

val_labels_test = []
for i in test_labels:
    if i == 0: val_labels_test.append("no")
    if i == 1: val_labels_test.append("yes")

testtokenizer.fit_on_texts(val_labels_test)
val_labels_pad = testtokenizer.texts_to_sequences(val_labels_test)

model = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1"
hub_layer = hub.KerasLayer(model, output_shape=[20], input_shape=[],
                           dtype=tf.string, trainable=True)

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(2))

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

history = model.fit(x=training_sentences,
                      y=list(map(lambda x: x[0] - 1, test_labels_pad)),
                      validation_data=(test_sentences, 
                      list(map(lambda x: x[0] - 1, val_labels_pad))),
                      epochs=1)

model.predict(training_sentences)

看看如果您有3个类别，并使用

[1,2,3]

而不是

[0,1,2]

，会发生什么：

y_true = tf.constant([1, 2, 3])
y_pred = tf.constant([[0.05, 0.95, 0], [0.1, 0.8, 0.1], [.2, .4, .4]])
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy()

y_true = tf.constant([0, 1, 2])
y_pred = tf.constant([[0.05, 0.95, 0], [0.1, 0.8, 0.1], [.2, .4, .4]])
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy()

但是它与

[0,1,2]

一起工作：

y_true = tf.constant([1, 2, 3])
y_pred = tf.constant([[0.05, 0.95, 0], [0.1, 0.8, 0.1], [.2, .4, .4]])
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy()

y_true = tf.constant([0, 1, 2])
y_pred = tf.constant([[0.05, 0.95, 0], [0.1, 0.8, 0.1], [.2, .4, .4]])
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy()

谢谢你的回答，加上1，但是为什么要使用model.add（tf.keras.layers.density（2））？你确实从标签中减去了1，所以我们有0和1，所以它在最后一层中应该是1个单位，在损失中应该是二进制交叉熵，在度量中应该是二进制精度？此外：为什么我不能用SparseCategorical做[1]和[2]呢？在官方的Coursera Tensorflow课程中，他们确实在标签上使用了标记器：因此他们有5个类（1,2,3,4,5），不仅有2个（但也缺少0）。那么，为什么这在两个班级中是“禁止”的呢？我进一步感到困惑，因为您确实在最后一层中使用了SparseCategoric和2个单位。

sparse\uz

用于损失函数或度量意味着输入不是一个热编码的。这里已经解释过了：。关于你的第二个问题，我必须说我很困惑。我倾向于说他们犯了一个错误。看看当你在3类损失函数（见我答案的底部）中添加了

[0,1,2]

以外的内容时会发生什么情况谢谢你的答案，加上1，但是为什么你要使用model.add（tf.keras.layers.Dense（2））？你确实从标签中减去了1，所以我们有0和1，所以它在最后一层中应该是1个单位，在损失中应该是二进制交叉熵，在度量中应该是二进制精度？此外：为什么我不能用SparseCategorical做[1]和[2]呢？在官方的Coursera Tensorflow课程中，他们确实在标签上使用了标记器：因此他们有5个类（1,2,3,4,5），不仅有2个（但也缺少0）。那么，为什么这在两个班级中是“禁止”的呢？我进一步感到困惑，因为您确实在最后一层中使用了SparseCategoric和2个单位。

sparse\uz

用于损失函数或度量意味着输入不是一个热编码的。这里已经解释过了：。关于你的第二个问题，我必须说我很困惑。我倾向于说他们犯了一个错误。看看当你在一个3类损失函数中放置了一些非

[0，1，2]

的东西时会发生什么（见我答案的底部）