Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/performance/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 快速信息增益计算_Python_Performance_Machine Learning_Scikit Learn_Feature Selection - Fatal编程技术网

Python 快速信息增益计算

Python 快速信息增益计算,python,performance,machine-learning,scikit-learn,feature-selection,Python,Performance,Machine Learning,Scikit Learn,Feature Selection,我需要计算文本分类10k文档中>100k特征的信息增益分数。下面的代码工作正常,但整个数据集的速度非常慢,在笔记本电脑上需要一个多小时。数据集是20newsgroup,我正在使用scikit学习,scikit中提供的功能运行速度非常快 您知道如何更快地计算此类数据集的信息增益吗 def information_gain(x, y): def _entropy(values): counts = np.bincount(values) probs = co

我需要计算文本分类10k文档中>100k特征的信息增益分数。下面的代码工作正常,但整个数据集的速度非常慢,在笔记本电脑上需要一个多小时。数据集是20newsgroup,我正在使用scikit学习,scikit中提供的功能运行速度非常快

您知道如何更快地计算此类数据集的信息增益吗

def information_gain(x, y):

    def _entropy(values):
        counts = np.bincount(values)
        probs = counts[np.nonzero(counts)] / float(len(values))
        return - np.sum(probs * np.log(probs))

    def _information_gain(feature, y):
        feature_set_indices = np.nonzero(feature)[1]
        feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices]
        entropy_x_set = _entropy(y[feature_set_indices])
        entropy_x_not_set = _entropy(y[feature_not_set_indices])

        return entropy_before - (((len(feature_set_indices) / float(feature_size)) * entropy_x_set)
                                 + ((len(feature_not_set_indices) / float(feature_size)) * entropy_x_not_set))

    feature_size = x.shape[0]
    feature_range = range(0, feature_size)
    entropy_before = _entropy(y)
    information_gain_scores = []

    for feature in x.T:
        information_gain_scores.append(_information_gain(feature, y))
    return information_gain_scores, []
编辑:

我合并了内部函数并运行了
cProfiler
,如下所示(在限制为~15k个功能和~1k个文档的数据集上):

结果前20名由
tottime

ncalls  tottime percall cumtime percall filename:lineno(function)
1       60.27   60.27   65.48   65.48   <string>:1(<module>)
16171   1.362   0   2.801   0   csr.py:313(_get_row_slice)
16171   0.523   0   0.892   0   coo.py:201(_check)
16173   0.394   0   0.89    0   compressed.py:101(check_format)
210235  0.297   0   0.297   0   {numpy.core.multiarray.array}
16173   0.287   0   0.331   0   compressed.py:631(prune)
16171   0.197   0   1.529   0   compressed.py:534(tocoo)
16173   0.165   0   1.263   0   compressed.py:20(__init__)
16171   0.139   0   1.669   0   base.py:415(nonzero)
16171   0.124   0   1.201   0   coo.py:111(__init__)
32342   0.123   0   0.123   0   {method 'max' of 'numpy.ndarray' objects}
48513   0.117   0   0.218   0   sputils.py:93(isintlike)
32342   0.114   0   0.114   0   {method 'sum' of 'numpy.ndarray' objects}
16171   0.106   0   3.081   0   csr.py:186(__getitem__)
32342   0.105   0   0.105   0   {numpy.lib._compiled_base.bincount}
32344   0.09    0   0.094   0   base.py:59(set_shape)
210227  0.088   0   0.088   0   {isinstance}
48513   0.081   0   1.777   0   fromnumeric.py:1129(nonzero)
32342   0.078   0   0.078   0   {method 'min' of 'numpy.ndarray' objects}
97032   0.066   0   0.153   0   numeric.py:167(asarray)
ncalls tottime percall cumtime percall文件名:lineno(函数)
1       60.27   60.27   65.48   65.48   :1()
16171 1.362 0 2.801 0 csr.py:313(_get_row_slice)
16171 0.5230 0.892 0合作伙伴:201(_检查)
161730.39400.890压缩。py:101(检查_格式)
2102350.2970.2970{numpy.core.multiarray.array}
161730.28700.3310压缩。py:631(修剪)
16171 0.197 0 1.529 0压缩。py:534(tocoo)
16173 0.165 0 1.263 0压缩。py:20(初始)
16171 0.139 0 1.669 0基。py:415(非零)
16171 0.124 0 1.201 0 coo.py:111(初始)
32342 0.123 0 0.123 0{“numpy.ndarray”对象的方法“max”}
48513 0.117 0 0.218 0 sputils.py:93(类圆形)
32342 0.114 0 0.114 0{“numpy.ndarray”对象的方法“和”}
16171 0.106 0 3.081 0 csr.py:186(获取项目)
323420.1050.1050{numpy.lib.\u编译的\u base.bincount}
32344 0.09 0 0.094 0底座。py:59(设置U形)
210227 0.088 0.088 0{isinstance}
48513 0.081 0 1.777 0 from numeric.py:1129(非零)
32342 0.078 0.078 0{“numpy.ndarray”对象的方法“min”}
970320.0660 0.1530数字。py:167(asarray)
看起来大部分时间都花在了
\u get\u row\u slice
上。我不完全确定第一行,看起来它涵盖了我提供给
cProfile.runctx
的整个块,尽管我不知道为什么第一行
totime=60.27
和第二行
tottime=1.362
之间有这么大的差距。差额在哪里?是否可以在
cProfile
中检查它


基本上,问题在于稀疏矩阵运算(切片、获取元素)——解决方案可能是使用矩阵代数计算信息增益(如)。但我不知道如何用矩阵运算来表达这个计算。。。任何人都有一个想法???

一年过去了,不知道它是否仍然有用。但是现在我碰巧面临着同样的文本分类任务。我已经使用为稀疏矩阵提供的函数重写了您的代码。然后我只扫描nz,计算相应的y_值并计算熵

以下代码只需几秒钟即可运行news20数据集(使用libsvm稀疏矩阵格式加载)


这是一个使用矩阵运算的版本。特征的IG是其特定类别分数的平均值

import numpy as np
from scipy.sparse import issparse
from sklearn.preprocessing import LabelBinarizer
from sklearn.utils import check_array
from sklearn.utils.extmath import safe_sparse_dot


def ig(X, y):

    def get_t1(fc, c, f):
        t = np.log2(fc/(c * f))
        t[~np.isfinite(t)] = 0
        return np.multiply(fc, t)

    def get_t2(fc, c, f):
        t = np.log2((1-f-c+fc)/((1-c)*(1-f)))
        t[~np.isfinite(t)] = 0
        return np.multiply((1-f-c+fc), t)

    def get_t3(c, f, class_count, observed, total):
        nfc = (class_count - observed)/total
        t = np.log2(nfc/(c*(1-f)))
        t[~np.isfinite(t)] = 0
        return np.multiply(nfc, t)

    def get_t4(c, f, feature_count, observed, total):
        fnc = (feature_count - observed)/total
        t = np.log2(fnc/((1-c)*f))
        t[~np.isfinite(t)] = 0
        return np.multiply(fnc, t)

    X = check_array(X, accept_sparse='csr')
    if np.any((X.data if issparse(X) else X) < 0):
        raise ValueError("Input X must be non-negative.")

    Y = LabelBinarizer().fit_transform(y)
    if Y.shape[1] == 1:
        Y = np.append(1 - Y, Y, axis=1)

    # counts

    observed = safe_sparse_dot(Y.T, X)          # n_classes * n_features
    total = observed.sum(axis=0).reshape(1, -1).sum()
    feature_count = X.sum(axis=0).reshape(1, -1)
    class_count = (X.sum(axis=1).reshape(1, -1) * Y).T

    # probs

    f = feature_count / feature_count.sum()
    c = class_count / float(class_count.sum())
    fc = observed / total

    # the feature score is averaged over classes
    scores = (get_t1(fc, c, f) +
            get_t2(fc, c, f) +
            get_t3(c, f, class_count, observed, total) +
            get_t4(c, f, feature_count, observed, total)).mean(axis=0)

    scores = np.asarray(scores).reshape(-1)

    return scores, []
将numpy导入为np
从scipy.sparse导入
从sklearn.preprocessing导入LabelBinarizer
从sklearn.utils导入检查\u数组
从sklearn.utils.extmath导入安全\u稀疏\u点
def ig(X,y):
def get_t1(fc、c、f):
t=np.log2(fc/(c*f))
t[~np.isfinite(t)]=0
返回np.乘法(fc,t)
def get_t2(fc、c、f):
t=np.log2((1-f-c+fc)/(1-c)*(1-f)))
t[~np.isfinite(t)]=0
返回np.乘法((1-f-c+fc),t)
def get_t3(c、f、等级计数、观察值、总计):
nfc=(观察到的类计数)/总数
t=np.log2(nfc/(c*(1-f)))
t[~np.isfinite(t)]=0
返回np.乘法(nfc,t)
def get_t4(c、f、特征计数、观察值、总计):
fnc=(特征计数-观察)/总计
t=np.log2(fnc/(1-c)*f))
t[~np.isfinite(t)]=0
返回np.乘法(fnc,t)
X=检查数组(X,接受
如果np.any((X.data如果issparse(X)else X)<0):
raise VALUERROR(“输入X必须为非负。”)
Y=LabelBinarizer().fit_变换(Y)
如果Y.shape[1]==1:
Y=np.append(1-Y,Y,轴=1)
#计数
观察=安全稀疏点(Y.T,X)#n_类*n_特征
总计=观察到的.sum(轴=0).重塑(1,-1).sum()
特征计数=X.sum(轴=0)。重塑(1,-1)
类_计数=(X.sum(轴=1)。重塑(1,-1)*Y)。T
#问题
f=特征计数/特征计数。总和()
c=类计数/浮点数(类计数.sum())
fc=观察值/总数
#特征分数是各类的平均值
分数=(获得t1(fc、c、f)+
获取t2(fc、c、f)+
获取t3(c、f、类计数、观察值、总数)+
获取t4(c、f、特征计数、观察值、总数)。平均值(轴=0)
分数=np.asarray(分数)。重塑(-1)
返回分数,[]

在具有1000个实例和1000个唯一特征的数据集上,此实现比没有矩阵运算的实现快100以上。

这是一个代码
特征\u未设置\u索引=[i for i in feature\u range if i not in feature\u set\u index]
占用90%的时间,尝试更改为设置操作

是否尝试使用探查器查看瓶颈在哪里?您是否尝试过对数据进行并行处理?谢谢,我编辑了postMy advice,与问题无关:在计算信息增益之前,先减少特征集,使用更简单、更容易计算的方法。例如,许多Ngram(我认为是你的特征)在语料库中只出现一两次,应该事先排除,从而大大减少你的特征集。先生,请在这里帮助我理解X和y。谢谢。这$X$是一个包含每个实例的特性的矩阵,其中每一行代表一个实例
def information_gain(X, y):

    def _calIg():
        entropy_x_set = 0
        entropy_x_not_set = 0
        for c in classCnt:
            probs = classCnt[c] / float(featureTot)
            entropy_x_set = entropy_x_set - probs * np.log(probs)
            probs = (classTotCnt[c] - classCnt[c]) / float(tot - featureTot)
            entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
        for c in classTotCnt:
            if c not in classCnt:
                probs = classTotCnt[c] / float(tot - featureTot)
                entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
        return entropy_before - ((featureTot / float(tot)) * entropy_x_set
                             +  ((tot - featureTot) / float(tot)) * entropy_x_not_set)

    tot = X.shape[0]
    classTotCnt = {}
    entropy_before = 0
    for i in y:
        if i not in classTotCnt:
            classTotCnt[i] = 1
        else:
            classTotCnt[i] = classTotCnt[i] + 1
    for c in classTotCnt:
        probs = classTotCnt[c] / float(tot)
        entropy_before = entropy_before - probs * np.log(probs)

    nz = X.T.nonzero()
    pre = 0
    classCnt = {}
    featureTot = 0
    information_gain = []
    for i in range(0, len(nz[0])):
        if (i != 0 and nz[0][i] != pre):
            for notappear in range(pre+1, nz[0][i]):
                information_gain.append(0)
            ig = _calIg()
            information_gain.append(ig)
            pre = nz[0][i]
            classCnt = {}
            featureTot = 0
        featureTot = featureTot + 1
        yclass = y[nz[1][i]]
        if yclass not in classCnt:
            classCnt[yclass] = 1
        else:
            classCnt[yclass] = classCnt[yclass] + 1
    ig = _calIg()
    information_gain.append(ig)

    return np.asarray(information_gain)
import numpy as np
from scipy.sparse import issparse
from sklearn.preprocessing import LabelBinarizer
from sklearn.utils import check_array
from sklearn.utils.extmath import safe_sparse_dot


def ig(X, y):

    def get_t1(fc, c, f):
        t = np.log2(fc/(c * f))
        t[~np.isfinite(t)] = 0
        return np.multiply(fc, t)

    def get_t2(fc, c, f):
        t = np.log2((1-f-c+fc)/((1-c)*(1-f)))
        t[~np.isfinite(t)] = 0
        return np.multiply((1-f-c+fc), t)

    def get_t3(c, f, class_count, observed, total):
        nfc = (class_count - observed)/total
        t = np.log2(nfc/(c*(1-f)))
        t[~np.isfinite(t)] = 0
        return np.multiply(nfc, t)

    def get_t4(c, f, feature_count, observed, total):
        fnc = (feature_count - observed)/total
        t = np.log2(fnc/((1-c)*f))
        t[~np.isfinite(t)] = 0
        return np.multiply(fnc, t)

    X = check_array(X, accept_sparse='csr')
    if np.any((X.data if issparse(X) else X) < 0):
        raise ValueError("Input X must be non-negative.")

    Y = LabelBinarizer().fit_transform(y)
    if Y.shape[1] == 1:
        Y = np.append(1 - Y, Y, axis=1)

    # counts

    observed = safe_sparse_dot(Y.T, X)          # n_classes * n_features
    total = observed.sum(axis=0).reshape(1, -1).sum()
    feature_count = X.sum(axis=0).reshape(1, -1)
    class_count = (X.sum(axis=1).reshape(1, -1) * Y).T

    # probs

    f = feature_count / feature_count.sum()
    c = class_count / float(class_count.sum())
    fc = observed / total

    # the feature score is averaged over classes
    scores = (get_t1(fc, c, f) +
            get_t2(fc, c, f) +
            get_t3(c, f, class_count, observed, total) +
            get_t4(c, f, feature_count, observed, total)).mean(axis=0)

    scores = np.asarray(scores).reshape(-1)

    return scores, []