Python 如何使用CatBoostClassifier为不同的输入定义不同的文本处理逻辑_Python_Catboost

Python 如何使用CatBoostClassifier为不同的输入定义不同的文本处理逻辑

python

Python 如何使用CatBoostClassifier为不同的输入定义不同的文本处理逻辑,python,catboost,Python,Catboost,我想构建一个使用多个文本输入的catboost分类器。其中每一种都需要不同类型的矢量化逻辑（有些是字级的，有些是字符级的）我试图遵循的主要文档如下：我设法为所有字段定义了一个通用的自定义“文本管道”： import catboost import pandas df = pandas.DataFrame({ 'a': ['Hello World', 'This is a test', 'foo', 'fooo'], 'b': ['Hello World', 'This i

我想构建一个使用多个文本输入的catboost分类器。其中每一种都需要不同类型的矢量化逻辑（有些是字级的，有些是字符级的）

我试图遵循的主要文档如下：

我设法为所有字段定义了一个通用的自定义“文本管道”：

import catboost
import pandas

df = pandas.DataFrame({
    'a': ['Hello World', 'This is a test', 'foo', 'fooo'],
    'b': ['Hello World', 'This is a test', 'foo', 'fooo'],
    'y': [0,0,1,1]
})

text_processing_options ={
    "tokenizers" : [{
        "tokenizer_id" : "tok",
        "delimiter" : " ",
        "lowercasing" : "true"
    }],

    "dictionaries" : [{
        "dictionary_id" : "word_uni",
        "gram_order" : "1",
        "occurrence_lower_bound":"1"
    }, {
        "dictionary_id" : "char_ngram",
        "gram_order" : "3",
        "token_level_type": 'Letter',
        "occurrence_lower_bound": "1"
    }],

    "feature_processing" : {
        "a" : [{
            "tokenizers_names" : ["tok"],
            "dictionaries_names" : ["word_uni"],
            "feature_calcers" : ["BoW"]
        }],
        "b" : [{
            "tokenizers_names" : ["tok"],
            "dictionaries_names" : ["char_ngram"],
            "feature_calcers" : ["BoW"]
        }],
    }
}
model = catboost.CatBoostClassifier(text_processing=text_processing_options, iterations=10)
model.fit(df[['a','b']], df['y'], text_features=['a', 'b'])

但对每一列尝试不同的逻辑（基于我对API的理解）失败了：

import catboost
import pandas

df = pandas.DataFrame({
    'a': ['Hello World', 'This is a test', 'foo', 'fooo'],
    'b': ['Hello World', 'This is a test', 'foo', 'fooo'],
    'y': [0,0,1,1]
})

text_processing_options ={
    "tokenizers" : [{
        "tokenizer_id" : "tok",
        "delimiter" : " ",
        "lowercasing" : "true"
    }],

    "dictionaries" : [{
        "dictionary_id" : "word_uni",
        "gram_order" : "1",
        "occurrence_lower_bound":"1"
    }, {
        "dictionary_id" : "char_ngram",
        "gram_order" : "3",
        "token_level_type": 'Letter',
        "occurrence_lower_bound": "1"
    }],

    "feature_processing" : {
        "a" : [{
            "tokenizers_names" : ["tok"],
            "dictionaries_names" : ["word_uni"],
            "feature_calcers" : ["BoW"]
        }],
        "b" : [{
            "tokenizers_names" : ["tok"],
            "dictionaries_names" : ["char_ngram"],
            "feature_calcers" : ["BoW"]
        }],
    }
}
model = catboost.CatBoostClassifier(text_processing=text_processing_options, iterations=10)
model.fit(df[['a','b']], df['y'], text_features=['a', 'b'])

另一个问题是是否可能获取\log处理后的输入以进行故障排除