Python 如何在Tensorflow中为多类分类编码分类数据以避免形状不匹配
我有一个数据框,其中一列用于文章正文Python 如何在Tensorflow中为多类分类编码分类数据以避免形状不匹配,python,pandas,tensorflow,encoding,bert-language-model,Python,Pandas,Tensorflow,Encoding,Bert Language Model,我有一个数据框,其中一列用于文章正文文本,一列用于主题标签主题主题包含标签列表 >>> df.topic.head(5) 0 [ECONOMIC PERFORMANCE, ECONOMICS, EQUITY MARKE... 1 [CAPACITY/FACILITIES, CORPORATE/INDUSTRIAL] 2 [PERFORMANCE, ACCOUNTS/EARNINGS, CORPORATE/IND... 3 [PERFORMAN
文本
,一列用于主题标签主题
<代码>主题包含标签列表
>>> df.topic.head(5)
0 [ECONOMIC PERFORMANCE, ECONOMICS, EQUITY MARKE...
1 [CAPACITY/FACILITIES, CORPORATE/INDUSTRIAL]
2 [PERFORMANCE, ACCOUNTS/EARNINGS, CORPORATE/IND...
3 [PERFORMANCE, ACCOUNTS/EARNINGS, CORPORATE/IND...
4 [STRATEGY/PLANS, NEW PRODUCTS/SERVICES, CORPOR...
Name: topic, dtype: object
我首先使用df=df分解topic
。分解('topic')。重置索引(drop=True)
,以便在每行上贴上标签
>>> df.head(5)
text topic
0 Emerging evidence that Mexico economy was back... ECONOMIC PERFORMANCE
1 Emerging evidence that Mexico economy was back... ECONOMICS
2 Emerging evidence that Mexico economy was back... EQUITY MARKETS
3 Emerging evidence that Mexico economy was back... BOND MARKETS
4 Emerging evidence that Mexico economy was back... MARKETS
目前,我正在使用
from sklearn.preprocessing import LabelEncoder
# Encode labels
le = LabelEncoder()
df['topic_encoded'] = le.fit_transform(df['topic'])
然后我使用scikit的训练测试分割
,使用distilbert标记数据,并将其转换为tensorflow数据集:
# Split dataset into train, test, val (70, 15, 15)
train, test = train_test_split(df, test_size=0.15)
train, val = train_test_split(train, test_size=0.15)
# Create new index
train_idx = [i for i in range(len(train.index))]
test_idx = [i for i in range(len(test.index))]
val_idx = [i for i in range(len(val.index))]
# Convert to numpy
x_train = train['text'].values[train_idx]
x_test = test['text'].values[test_idx]
x_val = val['text'].values[val_idx]
y_train = train['topic_encoded'].values[train_idx]
y_test = test['topic_encoded'].values[test_idx]
y_val = val['topic_encoded'].values[val_idx]
# Tokenize datasets
tr_tok = tokenizer(list(x_train), return_tensors='tf', truncation=True, padding=True, max_length=128)
val_tok = tokenizer(list(x_val), return_tensors='tf', truncation=True, padding=True, max_length=128)
test_tok = tokenizer(list(x_test), return_tensors='tf', truncation=True, padding=True, max_length=128)
# Convert dfs to tds
train_ds = tf.data.Dataset.from_tensor_slices((dict(tr_tok), y_train))
val_ds = tf.data.Dataset.from_tensor_slices((dict(val_tok), y_val))
test_ds = tf.data.Dataset.from_tensor_slices((dict(test_tok), y_test))
我的测试模型非常简单:
# Set up model
# Recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 2e-5
# 1 epochs for illustration, though multiple epochs might be better as long as we don't overfit
number_of_epochs = 1
# Model initialization
model = TFDistilBertForSequenceClassification.from_pretrained(
'distilbert-base-uncased',
num_labels=df['topic_encoded'].unique().shape[0]
)
print(df['topic_encoded'].unique().shape[0])
# Optimizer Adam recommended
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)
# We do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
bert_history = model.fit(
train_ds,
epochs=number_of_epochs,
validation_data=test_ds)
除非我得到一个ValueError:形状不匹配:标签的形状(接收到的(1,))应该与logit的形状相等,但最后一个维度(接收到的(128,91))除外。
在线
bert_history = model.fit(
train_ds,
epochs=number_of_epochs,
validation_data=test_ds)
我假设这与数据的编码方式有关,但我不确定如何解决这个问题。这应该是一个热编码而不是标签编码吗?任何帮助都会很好