Python tfidf二元逻辑回归分类器的sklearn图决策边界
假设我们有一个非常简单的逻辑回归模型Python tfidf二元逻辑回归分类器的sklearn图决策边界,python,matplotlib,scikit-learn,logistic-regression,tf-idf,Python,Matplotlib,Scikit Learn,Logistic Regression,Tf Idf,假设我们有一个非常简单的逻辑回归模型 from sklearn.compose import ColumnTransformer from sklearn.datasets import fetch_20newsgroups from sklearn.metrics import roc_auc_score from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn2pmml.feature_extraction.text import Splitter
import pandas as pd
import numpy as np
data_col1 = fetch_20newsgroups(
subset='train',
categories=['alt.atheism'],
remove=('headers', 'footers', 'quotes')
)
data_col2 = fetch_20newsgroups(
subset='train',
categories=['sci.space'],
remove=('headers', 'footers', 'quotes')
)
data = pd.DataFrame({
"col1": data_col1.data[:100],
"col2": data_col2.data[:100]
})
labels = np.random.randint(2, size=100)
train_data, test_data, train_labels, test_labels = train_test_split(
data,
labels,
test_size=0.1,
random_state=0,
shuffle=False
)
def title_features_pipeline():
return Pipeline([
('features', TfidfVectorizer(
analyzer='word',
stop_words='english',
use_idf=True,
# max_df=0.1,
min_df=0.01,
norm=None,
tokenizer=Splitter()
)),
], verbose=True)
pipeline = Pipeline([
('features', ColumnTransformer(
transformers = [
('col1-features', title_features_pipeline(), "col1"),
('col2-features', "drop", "col2")
],
remainder="drop",
)),
('regression', LogisticRegression(
multi_class='ovr',
max_iter=1000
))
], verbose=True)
pipeline.fit(train_data, train_labels)
pred = pipeline.predict(test_data)
print('ROC AUC = {:.3f}'.format(roc_auc_score(test_labels, pred)))
我花了大量的时间,包括stackoverflow示例和Github代码片段,但我无法得到任何与我的特定案例相关的东西,这让我发疯,我确信我只是做错了什么
我的目标是为这个LogisticalRegression分类器绘制决策边界,查看每个文档属于哪个类,以及在图上将两个类分开的边界
在此过程中,我想了解LogisticalRegression对来自TfidfVectorizer的向量到底做了什么。这是因为到目前为止,我看过的所有示例都是基于这样的假设绘制决策边界的,即只有简单的标量进入分类器,但在这种情况下,我们有长向量(tfidf)。。。我不明白一个向量如何转换成图上表示的单个值(是向量中所有分数的总和?还是其他什么)。逻辑回归将在tfidf向量器中为每个项学习一个标量值。通过将权重乘以tfidf分数并将其相加,将向量转换为分数
绘制决策边界通常是在二维或三维中完成的。当您有一个可能具有数百维的文本分类器时,绘制决策边界对您来说意味着什么并不清楚。谢谢您的回答,这是有意义的。因此,我所追求的是从逻辑回归中提取这一点:
向量通过将权重乘以tfidf分数并将它们全部相加来转换为分数。因此,正如您所说,绘制决策边界通常是在二维或三维中进行的。如果我能重现逻辑回归对向量的作用,我就能轻松地绘制它们。