Python 快速信息增益计算_Python_Performance_Machine Learning_Scikit Learn_Feature Selection

Python 快速信息增益计算

python performance machine-learning scikit-learn

Python 快速信息增益计算,python,performance,machine-learning,scikit-learn,feature-selection,Python,Performance,Machine Learning,Scikit Learn,Feature Selection,我需要计算文本分类10k文档中>100k特征的信息增益分数。下面的代码工作正常，但整个数据集的速度非常慢，在笔记本电脑上需要一个多小时。数据集是20newsgroup，我正在使用scikit学习，scikit中提供的功能运行速度非常快您知道如何更快地计算此类数据集的信息增益吗 def information_gain(x, y): def _entropy(values): counts = np.bincount(values) probs = co

我需要计算文本分类10k文档中>100k特征的信息增益分数。下面的代码工作正常，但整个数据集的速度非常慢，在笔记本电脑上需要一个多小时。数据集是20newsgroup，我正在使用scikit学习，scikit中提供的功能运行速度非常快
您知道如何更快地计算此类数据集的信息增益吗

def information_gain(x, y): def _entropy(values): counts = np.bincount(values) probs = counts[np.nonzero(counts)] / float(len(values)) return - np.sum(probs * np.log(probs)) def _information_gain(feature, y): feature_set_indices = np.nonzero(feature)[1] feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices] entropy_x_set = _entropy(y[feature_set_indices]) entropy_x_not_set = _entropy(y[feature_not_set_indices]) return entropy_before - (((len(feature_set_indices) / float(feature_size)) * entropy_x_set) + ((len(feature_not_set_indices) / float(feature_size)) * entropy_x_not_set)) feature_size = x.shape[0] feature_range = range(0, feature_size) entropy_before = _entropy(y) information_gain_scores = [] for feature in x.T: information_gain_scores.append(_information_gain(feature, y)) return information_gain_scores, []
编辑：
我合并了内部函数并运行了
cProfiler
，如下所示（在限制为~15k个功能和~1k个文档的数据集上）：
结果前20名由
tottime
：

ncalls tottime percall cumtime percall filename:lineno(function) 1 60.27 60.27 65.48 65.48 <string>:1(<module>) 16171 1.362 0 2.801 0 csr.py:313(_get_row_slice) 16171 0.523 0 0.892 0 coo.py:201(_check) 16173 0.394 0 0.89 0 compressed.py:101(check_format) 210235 0.297 0 0.297 0 {numpy.core.multiarray.array} 16173 0.287 0 0.331 0 compressed.py:631(prune) 16171 0.197 0 1.529 0 compressed.py:534(tocoo) 16173 0.165 0 1.263 0 compressed.py:20(__init__) 16171 0.139 0 1.669 0 base.py:415(nonzero) 16171 0.124 0 1.201 0 coo.py:111(__init__) 32342 0.123 0 0.123 0 {method 'max' of 'numpy.ndarray' objects} 48513 0.117 0 0.218 0 sputils.py:93(isintlike) 32342 0.114 0 0.114 0 {method 'sum' of 'numpy.ndarray' objects} 16171 0.106 0 3.081 0 csr.py:186(__getitem__) 32342 0.105 0 0.105 0 {numpy.lib._compiled_base.bincount} 32344 0.09 0 0.094 0 base.py:59(set_shape) 210227 0.088 0 0.088 0 {isinstance} 48513 0.081 0 1.777 0 fromnumeric.py:1129(nonzero) 32342 0.078 0 0.078 0 {method 'min' of 'numpy.ndarray' objects} 97032 0.066 0 0.153 0 numeric.py:167(asarray)

ncalls tottime percall cumtime percall文件名：lineno（函数） 1 60.27 60.27 65.48 65.48 :1() 16171 1.362 0 2.801 0 csr.py:313（_get_row_slice） 16171 0.5230 0.892 0合作伙伴：201（_检查） 161730.39400.890压缩。py:101（检查_格式） 2102350.2970.2970{numpy.core.multiarray.array} 161730.28700.3310压缩。py:631（修剪） 16171 0.197 0 1.529 0压缩。py:534（tocoo） 16173 0.165 0 1.263 0压缩。py:20（初始） 16171 0.139 0 1.669 0基。py:415（非零） 16171 0.124 0 1.201 0 coo.py:111（初始） 32342 0.123 0 0.123 0{“numpy.ndarray”对象的方法“max”} 48513 0.117 0 0.218 0 sputils.py:93（类圆形） 32342 0.114 0 0.114 0{“numpy.ndarray”对象的方法“和”} 16171 0.106 0 3.081 0 csr.py:186（获取项目） 323420.1050.1050{numpy.lib.\u编译的\u base.bincount} 32344 0.09 0 0.094 0底座。py:59（设置U形） 210227 0.088 0.088 0{isinstance} 48513 0.081 0 1.777 0 from numeric.py:1129（非零） 32342 0.078 0.078 0{“numpy.ndarray”对象的方法“min”} 970320.0660 0.1530数字。py:167（asarray）
看起来大部分时间都花在了
\u get\u row\u slice
上。我不完全确定第一行，看起来它涵盖了我提供给
cProfile.runctx
的整个块，尽管我不知道为什么第一行
totime=60.27
和第二行
tottime=1.362
之间有这么大的差距。差额在哪里？是否可以在
cProfile
中检查它

基本上，问题在于稀疏矩阵运算（切片、获取元素）——解决方案可能是使用矩阵代数计算信息增益（如）。但我不知道如何用矩阵运算来表达这个计算。。。任何人都有一个想法？？？
一年过去了，不知道它是否仍然有用。但是现在我碰巧面临着同样的文本分类任务。我已经使用为稀疏矩阵提供的函数重写了您的代码。然后我只扫描nz，计算相应的y_值并计算熵
以下代码只需几秒钟即可运行news20数据集（使用libsvm稀疏矩阵格式加载）

这是一个使用矩阵运算的版本。特征的IG是其特定类别分数的平均值

import numpy as np from scipy.sparse import issparse from sklearn.preprocessing import LabelBinarizer from sklearn.utils import check_array from sklearn.utils.extmath import safe_sparse_dot def ig(X, y): def get_t1(fc, c, f): t = np.log2(fc/(c * f)) t[~np.isfinite(t)] = 0 return np.multiply(fc, t) def get_t2(fc, c, f): t = np.log2((1-f-c+fc)/((1-c)*(1-f))) t[~np.isfinite(t)] = 0 return np.multiply((1-f-c+fc), t) def get_t3(c, f, class_count, observed, total): nfc = (class_count - observed)/total t = np.log2(nfc/(c*(1-f))) t[~np.isfinite(t)] = 0 return np.multiply(nfc, t) def get_t4(c, f, feature_count, observed, total): fnc = (feature_count - observed)/total t = np.log2(fnc/((1-c)*f)) t[~np.isfinite(t)] = 0 return np.multiply(fnc, t) X = check_array(X, accept_sparse='csr') if np.any((X.data if issparse(X) else X) < 0): raise ValueError("Input X must be non-negative.") Y = LabelBinarizer().fit_transform(y) if Y.shape[1] == 1: Y = np.append(1 - Y, Y, axis=1) # counts observed = safe_sparse_dot(Y.T, X) # n_classes * n_features total = observed.sum(axis=0).reshape(1, -1).sum() feature_count = X.sum(axis=0).reshape(1, -1) class_count = (X.sum(axis=1).reshape(1, -1) * Y).T # probs f = feature_count / feature_count.sum() c = class_count / float(class_count.sum()) fc = observed / total # the feature score is averaged over classes scores = (get_t1(fc, c, f) + get_t2(fc, c, f) + get_t3(c, f, class_count, observed, total) + get_t4(c, f, feature_count, observed, total)).mean(axis=0) scores = np.asarray(scores).reshape(-1) return scores, []

将numpy导入为np 从scipy.sparse导入从sklearn.preprocessing导入LabelBinarizer 从sklearn.utils导入检查\u数组从sklearn.utils.extmath导入安全\u稀疏\u点 def ig（X，y）： def get_t1（fc、c、f）： t=np.log2（fc/（c*f）） t[~np.isfinite（t）]=0 返回np.乘法（fc，t） def get_t2（fc、c、f）： t=np.log2（（1-f-c+fc）/（1-c）*（1-f））） t[~np.isfinite（t）]=0 返回np.乘法（（1-f-c+fc），t） def get_t3（c、f、等级计数、观察值、总计）： nfc=（观察到的类计数）/总数 t=np.log2（nfc/（c*（1-f））） t[~np.isfinite（t）]=0 返回np.乘法（nfc，t） def get_t4（c、f、特征计数、观察值、总计）： fnc=（特征计数-观察）/总计 t=np.log2（fnc/（1-c）*f）） t[~np.isfinite（t）]=0 返回np.乘法（fnc，t） X=检查数组（X，接受如果np.any（（X.data如果issparse（X）else X）<0）： raise VALUERROR（“输入X必须为非负。”） Y=LabelBinarizer（）.fit_变换（Y）如果Y.shape[1]==1： Y=np.append（1-Y，Y，轴=1） #计数观察=安全稀疏点（Y.T，X）#n_类*n_特征总计=观察到的.sum（轴=0）.重塑（1，-1）.sum（）特征计数=X.sum（轴=0）。重塑（1，-1）类_计数=（X.sum（轴=1）。重塑（1，-1）*Y）。T #问题 f=特征计数/特征计数。总和（） c=类计数/浮点数（类计数.sum（）） fc=观察值/总数 #特征分数是各类的平均值分数=（获得t1（fc、c、f）+ 获取t2（fc、c、f）+ 获取t3（c、f、类计数、观察值、总数）+ 获取t4（c、f、特征计数、观察值、总数）。平均值（轴=0）分数=np.asarray（分数）。重塑（-1）返回分数，[]

在具有1000个实例和1000个唯一特征的数据集上，此实现比没有矩阵运算的实现快100以上。
这是一个代码
特征\u未设置\u索引=[i for i in feature\u range if i not in feature\u set\u index]
占用90%的时间，尝试更改为设置操作
是否尝试使用探查器查看瓶颈在哪里？您是否尝试过对数据进行并行处理？谢谢，我编辑了postMy advice，与问题无关：在计算信息增益之前，先减少特征集，使用更简单、更容易计算的方法。例如，许多Ngram（我认为是你的特征）在语料库中只出现一两次，应该事先排除，从而大大减少你的特征集。先生，请在这里帮助我理解X和y。谢谢。这$X$是一个包含每个实例的特性的矩阵，其中每一行代表一个实例
def information_gain(X, y): def _calIg(): entropy_x_set = 0 entropy_x_not_set = 0 for c in classCnt: probs = classCnt[c] / float(featureTot) entropy_x_set = entropy_x_set - probs * np.log(probs) probs = (classTotCnt[c] - classCnt[c]) / float(tot - featureTot) entropy_x_not_set = entropy_x_not_set - probs * np.log(probs) for c in classTotCnt: if c not in classCnt: probs = classTotCnt[c] / float(tot - featureTot) entropy_x_not_set = entropy_x_not_set - probs * np.log(probs) return entropy_before - ((featureTot / float(tot)) * entropy_x_set + ((tot - featureTot) / float(tot)) * entropy_x_not_set) tot = X.shape[0] classTotCnt = {} entropy_before = 0 for i in y: if i not in classTotCnt: classTotCnt[i] = 1 else: classTotCnt[i] = classTotCnt[i] + 1 for c in classTotCnt: probs = classTotCnt[c] / float(tot) entropy_before = entropy_before - probs * np.log(probs) nz = X.T.nonzero() pre = 0 classCnt = {} featureTot = 0 information_gain = [] for i in range(0, len(nz[0])): if (i != 0 and nz[0][i] != pre): for notappear in range(pre+1, nz[0][i]): information_gain.append(0) ig = _calIg() information_gain.append(ig) pre = nz[0][i] classCnt = {} featureTot = 0 featureTot = featureTot + 1 yclass = y[nz[1][i]] if yclass not in classCnt: classCnt[yclass] = 1 else: classCnt[yclass] = classCnt[yclass] + 1 ig = _calIg() information_gain.append(ig) return np.asarray(information_gain)

import numpy as np from scipy.sparse import issparse from sklearn.preprocessing import LabelBinarizer from sklearn.utils import check_array from sklearn.utils.extmath import safe_sparse_dot def ig(X, y): def get_t1(fc, c, f): t = np.log2(fc/(c * f)) t[~np.isfinite(t)] = 0 return np.multiply(fc, t) def get_t2(fc, c, f): t = np.log2((1-f-c+fc)/((1-c)*(1-f))) t[~np.isfinite(t)] = 0 return np.multiply((1-f-c+fc), t) def get_t3(c, f, class_count, observed, total): nfc = (class_count - observed)/total t = np.log2(nfc/(c*(1-f))) t[~np.isfinite(t)] = 0 return np.multiply(nfc, t) def get_t4(c, f, feature_count, observed, total): fnc = (feature_count - observed)/total t = np.log2(fnc/((1-c)*f)) t[~np.isfinite(t)] = 0 return np.multiply(fnc, t) X = check_array(X, accept_sparse='csr') if np.any((X.data if issparse(X) else X) < 0): raise ValueError("Input X must be non-negative.") Y = LabelBinarizer().fit_transform(y) if Y.shape[1] == 1: Y = np.append(1 - Y, Y, axis=1) # counts observed = safe_sparse_dot(Y.T, X) # n_classes * n_features total = observed.sum(axis=0).reshape(1, -1).sum() feature_count = X.sum(axis=0).reshape(1, -1) class_count = (X.sum(axis=1).reshape(1, -1) * Y).T # probs f = feature_count / feature_count.sum() c = class_count / float(class_count.sum()) fc = observed / total # the feature score is averaged over classes scores = (get_t1(fc, c, f) + get_t2(fc, c, f) + get_t3(c, f, class_count, observed, total) + get_t4(c, f, feature_count, observed, total)).mean(axis=0) scores = np.asarray(scores).reshape(-1) return scores, []