Python 分类任务中的scipy矩阵到numpy数组
我有X_列车数据(类“pandas.core.series.series”)和内容Python 分类任务中的scipy矩阵到numpy数组,python,numpy,scikit-learn,scipy,logistic-regression,Python,Numpy,Scikit Learn,Scipy,Logistic Regression,我有X_列车数据(类“pandas.core.series.series”)和内容 print(X_train) 0 WASHINGTON — Congressional Republicans have... 1 After the bullet shells get counted, the blood... 2 When Walt Disney’s “Bambi” opened in 1942, cri... 3 Death may
print(X_train)
0 WASHINGTON — Congressional Republicans have...
1 After the bullet shells get counted, the blood...
2 When Walt Disney’s “Bambi” opened in 1942, cri...
3 Death may be the great equalizer, but it isn’t...
4 SEOUL, South Korea — North Korea’s leader, ...
然后我想准备数据进行分类:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
现在X_train_tfidf和X_train_计数是(类'scipy.sparse.csr.csr_matrix')
但是在我的逻辑回归函数中,我可以使用numpy数组。我该怎么做才能修好它
class LogisticRegression2:
def __init__(self, lr=0.01, num_iter=100000, fit_intercept=True, theta=0, verbose=False):
self.lr = lr
self.num_iter = num_iter
self.fit_intercept = fit_intercept
self.theta = theta
self.verbose = verbose
def __add_intercept(self, X):
intercept = np.ones((X.shape[0], 1))
return np.concatenate((intercept, X), axis=1)
def __sigmoid(self, z):
return 1 / (1 + np.exp(-z))
#return .5 * (1 + np.tanh(.5 * z))
def __loss(self, h, y):
return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
def fit(self, X, y):
if self.fit_intercept:
X = self.__add_intercept(X)
# weights initialization
self.theta = np.zeros(X.shape[1])
for i in range(self.num_iter):
z = np.dot(X, self.theta)
h = self.__sigmoid(z)
gradient = np.dot(X.T, (h - y)) / y.size
self.theta -= self.lr * gradient
if(self.verbose == True and i % 10000 == 0):
z = np.dot(X, self.theta)
h = self.__sigmoid(z)
print('loss: ', self.__loss(h, y))
def predict_prob(self, X):
if self.fit_intercept:
X = self.__add_intercept(X)
return self.__sigmoid(np.dot(X, self.theta))
def predict(self, X, threshold=0.5):
return self.predict_prob(X) >= threshold
如果我使用
X_train_dense = X_train_tfidf.toarray()
model = LogisticRegression2(lr=0.1, num_iter=100)
model.fit(X_train_dense, y_train)
preds = model.predict(X_train_dense)
我有TypeError:-:“float”和“str”的操作数类型不受支持
在
如果我尝试
def __add_intercept(self, X):
intercept = np.ones((X.shape[0], 1))
return hstack((intercept, X))
我有内存错误
X\u train\u density=X\u train\u tfidf.toarray()
toarray
是创建密集阵列的稀疏方法concatenate
不执行np.array(X\u train\u tfdf)
这是错误的。scipy.sparse
package documentscsr
矩阵及其方法。我尝试X\u train\u densite=X\u train\u tfidf.toarray()模型=逻辑回归2(lr=0.1,num\u iter=100)模型。拟合(X\u train\u densite,y\u train)preds=model。预测(X\u train\u densite)
,我有记忆错误,当试图从稀疏数组生成密集数组时,内存错误很常见。这就是代码首先生成稀疏矩阵的原因。logisticsregression
是否接受稀疏矩阵?还是只使用密集阵列?我使用密集阵列,它的工作原理很好。但是我也需要对文本进行分类,并且我有稀疏矩阵。您可以使用sparse.hstack
将截取数组添加到稀疏矩阵中。结果将是一个稀疏矩阵。X\u train\u densed=X\u train\u tfidf.toarray()
toarray
是创建密集阵列的稀疏方法concatenate
不执行np.array(X\u train\u tfdf)
这是错误的。scipy.sparse
package documentscsr
矩阵及其方法。我尝试X\u train\u densite=X\u train\u tfidf.toarray()模型=逻辑回归2(lr=0.1,num\u iter=100)模型。拟合(X\u train\u densite,y\u train)preds=model。预测(X\u train\u densite)
,我有记忆错误,当试图从稀疏数组生成密集数组时,内存错误很常见。这就是代码首先生成稀疏矩阵的原因。logisticsregression
是否接受稀疏矩阵?还是只使用密集阵列?我使用密集阵列,它的工作原理很好。但是我也需要对文本进行分类,并且我有稀疏矩阵。您可以使用sparse.hstack
将截取数组添加到稀疏矩阵中。结果将是一个稀疏矩阵。
def __add_intercept(self, X):
intercept = np.ones((X.shape[0], 1))
return hstack((intercept, X))