Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/360.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Sklearn逻辑回归形状错误,但x、y形状一致_Python_Scikit Learn - Fatal编程技术网

Python Sklearn逻辑回归形状错误,但x、y形状一致

Python Sklearn逻辑回归形状错误,但x、y形状一致,python,scikit-learn,Python,Scikit Learn,我得到一个ValueError:运行以下命令时发现输入变量的样本数不一致:[20000,1],即使x和y的行值正确。我加载RCV1数据集,获取具有前x个文档的类别索引,为每个类别创建具有相同数量随机选择的正数和负数的元组列表,然后最后尝试对其中一个类别运行逻辑回归 import sklearn.datasets from sklearn import model_selection, preprocessing from sklearn.linear_model import LogisticR

我得到一个ValueError:运行以下命令时发现输入变量的样本数不一致:[20000,1],即使x和y的行值正确。我加载RCV1数据集,获取具有前x个文档的类别索引,为每个类别创建具有相同数量随机选择的正数和负数的元组列表,然后最后尝试对其中一个类别运行逻辑回归

import sklearn.datasets
from sklearn import model_selection, preprocessing
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot as plt
from scipy import sparse

rcv1 = sklearn.datasets.fetch_rcv1()

def get_top_cat_indices(target_matrix, num_cats):
    cat_counts = target_matrix.sum(axis=0)
    #cat_counts = cat_counts.reshape((1,103)).tolist()[0]
    cat_counts = cat_counts.reshape((103,))

    #b = sorted(cat_counts, reverse=True)
    ind_temp = np.argsort(cat_counts)[::-1].tolist()[0]

    ind = [ind_temp[i] for i in range(5)]
    return ind

def prepare_data(x, y, top_cat_indices, sample_size):
    res_lst = []

    for i in top_cat_indices:

        # get column of indices with relevant cat
        temp = y.tocsc()[:, i]

        # all docs with labeled category
        cat_present = x.tocsr()[np.where(temp.sum(axis=1)>0)[0],:]
        # all docs other than labelled category
        cat_notpresent = x.tocsr()[np.where(temp.sum(axis=1)==0)[0],:]
        # get indices equal to 1/2 of sample size
        idx_cat = np.random.randint(cat_present.shape[0], size=int(sample_size/2))
        idx_nocat = np.random.randint(cat_notpresent.shape[0], size=int(sample_size/2))
        # concatenate the ids

        sampled_x_pos = cat_present.tocsr()[idx_cat,:]
        sampled_x_neg = cat_notpresent.tocsr()[idx_nocat,:]
        sampled_x = sparse.vstack((sampled_x_pos, sampled_x_neg))

        sampled_y_pos = temp.tocsr()[idx_cat,:]
        sampled_y_neg = temp.tocsr()[idx_nocat,:]
        sampled_y = sparse.vstack((sampled_y_pos, sampled_y_neg))

        res_lst.append((sampled_x, sampled_y))

    return res_lst

ind = get_top_cat_indices(rcv1.target, 5)
test_res = prepare_data(train_x, train_y, ind, 20000)

x, y = test_res[0]
print(x.shape)
print(y.shape)
LogisticRegression().fit(x, y)

这可能是稀疏矩阵的问题,还是维数的问题(有20K个样本和47K个特征)

当我运行代码时,我发现以下错误:

AttributeError:“bool”对象没有属性“any”

这是因为
LogisticRegression
y
需要numpy数组。因此,我将最后一行更改为:

LogisticRegression().fit(x, y.A.flatten())
然后我得到以下错误:

ValueError:此解算器需要数据中至少2个类的样本,但数据仅包含一个类:0

这是因为采样代码有一个bug。在使用采样索引之前,需要使用具有该类别的行对y数组进行子集划分。见下面的代码:

def prepare_data(x, y, top_cat_indices, sample_size):
    res_lst = []

    for i in top_cat_indices:

        # get column of indices with relevant cat
        temp = y.tocsc()[:, i]

        # all docs with labeled category
        c1 = np.where(temp.sum(axis=1)>0)[0]
        c2 = np.where(temp.sum(axis=1)==0)[0]
        cat_present = x.tocsr()[c1,:]
        # all docs other than labelled category
        cat_notpresent = x.tocsr()[c2,:]
        # get indices equal to 1/2 of sample size
        idx_cat = np.random.randint(cat_present.shape[0], size=int(sample_size/2))
        idx_nocat = np.random.randint(cat_notpresent.shape[0], size=int(sample_size/2))
        # concatenate the ids

        sampled_x_pos = cat_present.tocsr()[idx_cat,:]
        sampled_x_neg = cat_notpresent.tocsr()[idx_nocat,:]
        sampled_x = sparse.vstack((sampled_x_pos, sampled_x_neg))

        sampled_y_pos = temp.tocsr()[c1][idx_cat,:]
        print(sampled_y_pos.nnz)
        sampled_y_neg = temp.tocsr()[c2][idx_nocat,:]
        print(sampled_y_neg.nnz)

        sampled_y = sparse.vstack((sampled_y_pos, sampled_y_neg))

        res_lst.append((sampled_x, sampled_y))

    return res_lst

现在,一切都像一个符咒一样运行

嗨,在这个例子中,train_x和train_y是什么?它们是
train\u x,train\u y=rcv1.data,rcv1.target
看起来很棒。当我们拟合逻辑回归时,我们真的需要展平法吗?好像没什么意思,我没弄到