Python 如何使用CatBoostClassifier为不同的输入定义不同的文本处理逻辑
我想构建一个使用多个文本输入的catboost分类器。 其中每一种都需要不同类型的矢量化逻辑(有些是字级的,有些是字符级的) 我试图遵循的主要文档如下: 我设法为所有字段定义了一个通用的自定义“文本管道”:Python 如何使用CatBoostClassifier为不同的输入定义不同的文本处理逻辑,python,catboost,Python,Catboost,我想构建一个使用多个文本输入的catboost分类器。 其中每一种都需要不同类型的矢量化逻辑(有些是字级的,有些是字符级的) 我试图遵循的主要文档如下: 我设法为所有字段定义了一个通用的自定义“文本管道”: import catboost import pandas df = pandas.DataFrame({ 'a': ['Hello World', 'This is a test', 'foo', 'fooo'], 'b': ['Hello World', 'This i
import catboost
import pandas
df = pandas.DataFrame({
'a': ['Hello World', 'This is a test', 'foo', 'fooo'],
'b': ['Hello World', 'This is a test', 'foo', 'fooo'],
'y': [0,0,1,1]
})
text_processing_options ={
"tokenizers" : [{
"tokenizer_id" : "tok",
"delimiter" : " ",
"lowercasing" : "true"
}],
"dictionaries" : [{
"dictionary_id" : "word_uni",
"gram_order" : "1",
"occurrence_lower_bound":"1"
}, {
"dictionary_id" : "char_ngram",
"gram_order" : "3",
"token_level_type": 'Letter',
"occurrence_lower_bound": "1"
}],
"feature_processing" : {
"a" : [{
"tokenizers_names" : ["tok"],
"dictionaries_names" : ["word_uni"],
"feature_calcers" : ["BoW"]
}],
"b" : [{
"tokenizers_names" : ["tok"],
"dictionaries_names" : ["char_ngram"],
"feature_calcers" : ["BoW"]
}],
}
}
model = catboost.CatBoostClassifier(text_processing=text_processing_options, iterations=10)
model.fit(df[['a','b']], df['y'], text_features=['a', 'b'])
但对每一列尝试不同的逻辑(基于我对API的理解)失败了:
import catboost
import pandas
df = pandas.DataFrame({
'a': ['Hello World', 'This is a test', 'foo', 'fooo'],
'b': ['Hello World', 'This is a test', 'foo', 'fooo'],
'y': [0,0,1,1]
})
text_processing_options ={
"tokenizers" : [{
"tokenizer_id" : "tok",
"delimiter" : " ",
"lowercasing" : "true"
}],
"dictionaries" : [{
"dictionary_id" : "word_uni",
"gram_order" : "1",
"occurrence_lower_bound":"1"
}, {
"dictionary_id" : "char_ngram",
"gram_order" : "3",
"token_level_type": 'Letter',
"occurrence_lower_bound": "1"
}],
"feature_processing" : {
"a" : [{
"tokenizers_names" : ["tok"],
"dictionaries_names" : ["word_uni"],
"feature_calcers" : ["BoW"]
}],
"b" : [{
"tokenizers_names" : ["tok"],
"dictionaries_names" : ["char_ngram"],
"feature_calcers" : ["BoW"]
}],
}
}
model = catboost.CatBoostClassifier(text_processing=text_processing_options, iterations=10)
model.fit(df[['a','b']], df['y'], text_features=['a', 'b'])
另一个问题是是否可能获取\log处理后的输入以进行故障排除