Python 分类任务中的scipy矩阵到numpy数组_Python_Numpy_Scikit Learn_Scipy_Logistic Regression

Python 分类任务中的scipy矩阵到numpy数组

python numpy scikit-learn

Python 分类任务中的scipy矩阵到numpy数组,python,numpy,scikit-learn,scipy,logistic-regression,Python,Numpy,Scikit Learn,Scipy,Logistic Regression,我有X_列车数据（类“pandas.core.series.series”）和内容 print(X_train) 0 WASHINGTON — Congressional Republicans have... 1 After the bullet shells get counted, the blood... 2 When Walt Disney’s “Bambi” opened in 1942, cri... 3 Death may

我有X_列车数据（类“pandas.core.series.series”）和内容

print(X_train)

0       WASHINGTON  —   Congressional Republicans have...
1       After the bullet shells get counted, the blood...
2       When Walt Disney’s “Bambi” opened in 1942, cri...
3       Death may be the great equalizer, but it isn’t...
4       SEOUL, South Korea  —   North Korea’s leader, ...

然后我想准备数据进行分类：

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

现在X_train_tfidf和X_train_计数是（类'scipy.sparse.csr.csr_matrix'）

但是在我的逻辑回归函数中，我可以使用numpy数组。我该怎么做才能修好它

class LogisticRegression2:
    def __init__(self, lr=0.01, num_iter=100000, fit_intercept=True, theta=0, verbose=False):
        self.lr = lr
        self.num_iter = num_iter
        self.fit_intercept = fit_intercept
        self.theta = theta
        self.verbose = verbose

    def __add_intercept(self, X):
        intercept = np.ones((X.shape[0], 1))
        return np.concatenate((intercept, X), axis=1)

    def __sigmoid(self, z):
        return 1 / (1 + np.exp(-z))
        #return .5 * (1 + np.tanh(.5 * z))

    def __loss(self, h, y):
        return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()

    def fit(self, X, y):
        if self.fit_intercept:
            X = self.__add_intercept(X)

        # weights initialization
        self.theta = np.zeros(X.shape[1])

        for i in range(self.num_iter):
            z = np.dot(X, self.theta)
            h = self.__sigmoid(z)
            gradient = np.dot(X.T, (h - y)) / y.size
            self.theta -= self.lr * gradient

            if(self.verbose == True and i % 10000 == 0):
                z = np.dot(X, self.theta)
                h = self.__sigmoid(z)
                print('loss: ', self.__loss(h, y))

    def predict_prob(self, X):
        if self.fit_intercept:
            X = self.__add_intercept(X)

        return self.__sigmoid(np.dot(X, self.theta))

    def predict(self, X, threshold=0.5):
        return self.predict_prob(X) >= threshold

如果我使用

X_train_dense = X_train_tfidf.toarray()

model = LogisticRegression2(lr=0.1, num_iter=100)
model.fit(X_train_dense, y_train)
preds = model.predict(X_train_dense)

我有TypeError:-：“float”和“str”的操作数类型不受支持在

如果我尝试

def __add_intercept(self, X):
    intercept = np.ones((X.shape[0], 1))
    return hstack((intercept, X))

我有内存错误

X\u train\u density=X\u train\u tfidf.toarray（）

toarray

是创建密集阵列的稀疏方法

concatenate

不执行

np.array（X\u train\u tfdf）

这是错误的。

scipy.sparse

package documents

csr

矩阵及其方法。我尝试

X\u train\u densite=X\u train\u tfidf.toarray（）模型=逻辑回归2（lr=0.1，num\u iter=100）模型。拟合（X\u train\u densite，y\u train）preds=model。预测（X\u train\u densite）

，我有记忆错误，当试图从稀疏数组生成密集数组时，内存错误很常见。这就是代码首先生成稀疏矩阵的原因。

logisticsregression

是否接受稀疏矩阵？还是只使用密集阵列？我使用密集阵列，它的工作原理很好。但是我也需要对文本进行分类，并且我有稀疏矩阵。您可以使用

sparse.hstack

将截取数组添加到稀疏矩阵中。结果将是一个稀疏矩阵。

X\u train\u densed=X\u train\u tfidf.toarray（）

toarray