Python 如何获得BERT的预编程_Python_Keras

Python 如何获得BERT的预编程

python keras

Python 如何获得BERT的预编程,python,keras,Python,Keras,我正在使用stackoverflow选项卡分类csv数据集，我已将其加载到数据框中： X = df.post y = df.tags X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42) 除了一些其他的分类模型之外，我还想运行BERT，但是，它需要一个变量preproc。我不确定哪个函数将获得： import ktrain from ktrain import te

我正在使用stackoverflow选项卡分类csv数据集，我已将其加载到数据框中：

X = df.post
y = df.tags
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)

除了一些其他的分类模型之外，我还想运行BERT，但是，它需要一个变量preproc。我不确定哪个函数将获得：

import ktrain
from ktrain import text
model = text.text_classifier('bert', (x_train, y_train), preproc=preproc)
learner = ktrain.get_learner(model,train_data=(x_train, y_train), val_data=(x_test, y_test), batch_size=6)

在一些文档中，我看到人们使用来自文件夹（）的text.text\u，但我已经将所有内容都放在数据框中了。文本中还有其他功能吗。那能帮我做预处理吗

我也没有找到，所以我编写了一个函数，将csv拆分为txt文件：

import time
import os
from joblib import Parallel, delayed
from tqdm import tqdm_notebook as tqdm

treads=12
path = os.getcwd()
train_path = path + '/' + 'train_df' + '/'
test_path = path + '/' + 'test_df' + '/'

train_len = range(len(train_df['text']))
texts = train_df['text'].tolist()
ids = train_df['id'].tolist()
classes= train_df['class'].tolist()

def create_directory(directory):
    try:
        os.mkdir(directory)
    except OSError:
        print('OSError')
    else:
        print('Error')

def write_txt(text_, id_, class_, path, i):
    cur_path = path + '/' + str(id_) + '/'
    create_directory(cur_path)
    with open(cur_path + f'{class_}_{i}.txt', 'w', encoding='utf-8') as f:
        f.write(text_)

Parallel(n_jobs=treads)(delayed(write_txt)(texts[i], ids[i], classes[i], path, i) for i in tqdm(train_len))

有关可用预处理函数的完整列表，请参见ktrain。例如，在您的情况下，您可以使用来自_df的

文本_

或来自_数组的

文本_

。这些函数将以模型预期的方式预处理文本文档。有关使用来自_df的文本的示例，请参见。或者，您可以在ktrain中使用