Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/342.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何处理大型矩阵?_Python_Matrix_Machine Learning - Fatal编程技术网

Python 如何处理大型矩阵?

Python 如何处理大型矩阵?,python,matrix,machine-learning,Python,Matrix,Machine Learning,我正在使用监督学习执行主题检测。但是,我的矩阵非常大(202180 x 15000),我无法将它们放入我想要的模型中。大多数矩阵由零组成。只有逻辑回归有效。有没有一种方法可以让我继续使用同一个矩阵,但让他们能够使用我想要的模型?我可以用不同的方式创建矩阵吗 这是我的密码: import numpy as np import subprocess from sklearn.linear_model import SGDClassifier from sklearn.linear_model imp

我正在使用监督学习执行主题检测。但是,我的矩阵非常大(
202180 x 15000
),我无法将它们放入我想要的模型中。大多数矩阵由零组成。只有逻辑回归有效。有没有一种方法可以让我继续使用同一个矩阵,但让他们能够使用我想要的模型?我可以用不同的方式创建矩阵吗

这是我的密码:

import numpy as np
import subprocess
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression

from sklearn import metrics

def run(command):
    output = subprocess.check_output(command, shell=True)
    return output
加载词汇表 创建列车矩阵 制作测试矩阵 负荷监督模型
您可以使用名为稀疏矩阵的
scipy
包中提供的特定数据结构:

根据报告:

稀疏矩阵就是有大量零值的矩阵。相比之下,如果矩阵中的许多或大多数项都不是零,则称其为稠密矩阵。对于什么构成稀疏矩阵没有严格的规则,所以如果利用矩阵的稀疏性有一些好处,我们就说它是稀疏的。此外,还有各种稀疏矩阵格式,旨在利用不同的稀疏模式(稀疏矩阵中非零值的结构)和访问和操作矩阵项的不同方法

 f = open('/Users/win/Documents/wholedata/RightVo.txt','r')
    vocab_temp = f.read().split()
    f.close()
    col = len(vocab_temp)
    print("Training column size:")
    print(col)
row = run('cat '+'/Users/win/Documents/wholedata/X_tr.txt'+" | wc -l").split()[0]
print("Training row size:")
print(row)
matrix_tmp = np.zeros((int(row),col), dtype=np.int64)
print("Train Matrix size:")
print(matrix_tmp.size)

label_tmp = np.zeros((int(row)), dtype=np.int64)
f = open('/Users/win/Documents/wholedata/X_tr.txt','r')
count = 0
for line in f:
    line_tmp = line.split()
    #print(line_tmp)
    for word in line_tmp[0:]:
        if word not in vocab_temp:
            continue
        matrix_tmp[count][vocab_temp.index(word)] = 1
    count = count + 1
f.close()
print("Train matrix is:\n ")
print(matrix_tmp)
print(label_tmp)
print("Train Label size:")
print(len(label_tmp))

f = open('/Users/win/Documents/wholedata/RightVo.txt','r')
vocab_tmp = f.read().split()
f.close()
col = len(vocab_tmp)
print("Test column size:")
print(col)
row = run('cat '+'/Users/win/Documents/wholedata/X_te.txt'+" | wc -l").split()[0]
print("Test row size:")
print(row)
matrix_tmp_test = np.zeros((int(row),col), dtype=np.int64)
print("Test matrix size:")
print(matrix_tmp_test.size)

label_tmp_test = np.zeros((int(row)), dtype=np.int64)

f = open('/Users/win/Documents/wholedata/X_te.txt','r')
count = 0
for line in f:
    line_tmp = line.split()
    #print(line_tmp)
    for word in line_tmp[0:]:
        if word not in vocab_tmp:
            continue
        matrix_tmp_test[count][vocab_tmp.index(word)] = 1
    count = count + 1
f.close()
print("Test Matrix is: \n")
print(matrix_tmp_test)
print(label_tmp_test)

print("Test Label Size:")
print(len(label_tmp_test))

xtrain=[]
with open("/Users/win/Documents/wholedata/Y_te.txt") as filer:
    for line in filer:
        xtrain.append(line.strip().split())
xtrain= np.ravel(xtrain)
label_tmp_test=xtrain

ytrain=[]
with open("/Users/win/Documents/wholedata/Y_tr.txt") as filer:
    for line in filer:
        ytrain.append(line.strip().split())
ytrain = np.ravel(ytrain)
label_tmp=ytrain
model = LogisticRegression()
model = model.fit(matrix_tmp, label_tmp)
#print(model)
print("Entered 1")
y_train_pred = model.predict(matrix_tmp_test)
print("Entered 2")
print(metrics.accuracy_score(label_tmp_test, y_train_pred))