Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/heroku/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何使用CatBoostClassifier为不同的输入定义不同的文本处理逻辑_Python_Catboost - Fatal编程技术网

Python 如何使用CatBoostClassifier为不同的输入定义不同的文本处理逻辑

Python 如何使用CatBoostClassifier为不同的输入定义不同的文本处理逻辑,python,catboost,Python,Catboost,我想构建一个使用多个文本输入的catboost分类器。 其中每一种都需要不同类型的矢量化逻辑(有些是字级的,有些是字符级的) 我试图遵循的主要文档如下: 我设法为所有字段定义了一个通用的自定义“文本管道”: import catboost import pandas df = pandas.DataFrame({ 'a': ['Hello World', 'This is a test', 'foo', 'fooo'], 'b': ['Hello World', 'This i

我想构建一个使用多个文本输入的catboost分类器。 其中每一种都需要不同类型的矢量化逻辑(有些是字级的,有些是字符级的)

我试图遵循的主要文档如下:

我设法为所有字段定义了一个通用的自定义“文本管道”:

import catboost
import pandas

df = pandas.DataFrame({
    'a': ['Hello World', 'This is a test', 'foo', 'fooo'],
    'b': ['Hello World', 'This is a test', 'foo', 'fooo'],
    'y': [0,0,1,1]
})

text_processing_options ={
    "tokenizers" : [{
        "tokenizer_id" : "tok",
        "delimiter" : " ",
        "lowercasing" : "true"
    }],

    "dictionaries" : [{
        "dictionary_id" : "word_uni",
        "gram_order" : "1",
        "occurrence_lower_bound":"1"
    }, {
        "dictionary_id" : "char_ngram",
        "gram_order" : "3",
        "token_level_type": 'Letter',
        "occurrence_lower_bound": "1"
    }],

    "feature_processing" : {
        "a" : [{
            "tokenizers_names" : ["tok"],
            "dictionaries_names" : ["word_uni"],
            "feature_calcers" : ["BoW"]
        }],
        "b" : [{
            "tokenizers_names" : ["tok"],
            "dictionaries_names" : ["char_ngram"],
            "feature_calcers" : ["BoW"]
        }],
    }
}
model = catboost.CatBoostClassifier(text_processing=text_processing_options, iterations=10)
model.fit(df[['a','b']], df['y'], text_features=['a', 'b'])
但对每一列尝试不同的逻辑(基于我对API的理解)失败了:

import catboost
import pandas

df = pandas.DataFrame({
    'a': ['Hello World', 'This is a test', 'foo', 'fooo'],
    'b': ['Hello World', 'This is a test', 'foo', 'fooo'],
    'y': [0,0,1,1]
})

text_processing_options ={
    "tokenizers" : [{
        "tokenizer_id" : "tok",
        "delimiter" : " ",
        "lowercasing" : "true"
    }],

    "dictionaries" : [{
        "dictionary_id" : "word_uni",
        "gram_order" : "1",
        "occurrence_lower_bound":"1"
    }, {
        "dictionary_id" : "char_ngram",
        "gram_order" : "3",
        "token_level_type": 'Letter',
        "occurrence_lower_bound": "1"
    }],

    "feature_processing" : {
        "a" : [{
            "tokenizers_names" : ["tok"],
            "dictionaries_names" : ["word_uni"],
            "feature_calcers" : ["BoW"]
        }],
        "b" : [{
            "tokenizers_names" : ["tok"],
            "dictionaries_names" : ["char_ngram"],
            "feature_calcers" : ["BoW"]
        }],
    }
}
model = catboost.CatBoostClassifier(text_processing=text_processing_options, iterations=10)
model.fit(df[['a','b']], df['y'], text_features=['a', 'b'])
另一个问题是是否可能获取\log处理后的输入以进行故障排除