Python “如何解决错误”;值对于数据类型(';float32';)太大?“;
我读过很多类似的问题,但仍然无法理解Python “如何解决错误”;值对于数据类型(';float32';)太大?“;,python,numpy,scikit-learn,Python,Numpy,Scikit Learn,我读过很多类似的问题,但仍然无法理解 clf = DecisionTreeClassifier() clf.fit(X_train, y_train) X_to_predict = array([[ 1.37097033e+002, 0.00000000e+000, -1.82710826e+296, 1.22703799e+002, 1.37097033e+002, -2.56391552e+001, 1.11457878e+002,
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
X_to_predict = array([[ 1.37097033e+002, 0.00000000e+000, -1.82710826e+296,
1.22703799e+002, 1.37097033e+002, -2.56391552e+001,
1.11457878e+002, 1.37097033e+002, -2.56391552e+001,
9.81898928e+001, 1.22703799e+002, -2.45139066e+001,
9.24341823e+001, 1.11457878e+002, -1.90236954e+001]])
clf.predict_proba(X_to_predict)
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
我的问题既不是nan
也不是inf
值,因为:
np.isnan(X_to_predict).sum()
Out[147]: 0
np.isinf(X_to_predict).sum()
Out[148]: 0
问题:如何将
X\u to\u predict
转换为对float32来说不太大的值,同时在小数点后保留尽可能多的数字?如果检查数组X\u to\u predict
的dtype
,它应该显示float64
# slightly modified array from the question
X_to_predict = np.array([1.37097033e+002, 0.00000000e+000, -1.82710826e+296,
1.22703799e+002, 1.37097033e+002, -2.56391552e+001,
1.11457878e+002, 1.37097033e+002, -2.56391552e+001,
9.81898928e+001, 1.22703799e+002, -2.45139066e+001]).reshape((3, 4))
print(X_to_predict.dtype)
>>> float64
sklearn的RandomForestClassifier将数组静默地转换为float32
,有关错误消息的来源,请参阅讨论
你可以自己转换
print(X_to_predict.astype(np.float32)))
>>> array([[137.09703 , 0. , -inf, 122.7038 ],
[137.09703 , -25.639154, 111.45788 , 137.09703 ],
[-25.639154, 98.189896, 122.7038 , -24.513906]],
dtype=float32)
第三个值(-1.82710826e+296)在float32中变为-inf
。唯一的解决方法是用float32的最大值替换inf
值。据我所知,除了在sklearn中更改实现并重新编译之外,目前没有任何参数或解决方法,这会使您失去一些精度
如果使用np.nan\u to\u num
您的数组应该如下所示:
new_X = np.nan_to_num(X_to_predict.astype(np.float32))
print(new_X)
>>> array([[ 1.3709703e+02, 0.0000000e+00, -3.4028235e+38, 1.2270380e+02],
[ 1.3709703e+02, -2.5639154e+01, 1.1145788e+02, 1.3709703e+02],
[-2.5639154e+01, 9.8189896e+01, 1.2270380e+02, -2.4513906e+01]],
dtype=float32)
这应该被你的分类器所接受
完整代码
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
iris = load_iris()
clf = RandomForestClassifier(n_estimators=10,
random_state=42)
clf.fit(iris.data, iris.target)
X_to_predict = np.array([1.37097033e+002, 0.00000000e+000, -1.82710826e+296,
1.22703799e+002, 1.37097033e+002, -2.56391552e+001,
1.11457878e+002, 1.37097033e+002, -2.56391552e+001,
9.81898928e+001, 1.22703799e+002, -2.45139066e+001]).reshape((3, 4))
print(X_to_predict.dtype)
print(X_to_predict.astype(np.float32))
new_X = np.nan_to_num(X_to_predict.astype(np.float32))
print(new_X)
#should return array([2, 2, 0])
print(clf.predict(new_X))
# should crash
clf.predict(X_to_predict)
这个错误有时会引起误解。如果数据集中的值为空(这意味着数据集中的某些要素具有空值),则可能会出现这种类型的错误。我们如何解决这个问题 转换数据帧并将其导出为csv。下面是代码“df”是数据帧到CSV的数据帧 压缩\u opts=dict(方法='zip',存档\u name='out.csv') df.to_csv('out.zip',index=False,compression=compression\u opts)您也可以尝试以下方法 df[df['column_name']=''。索引 通过分析输出CSV,识别具有空白值的特征 通过下面的代码df=df.dropna(子集=['column_name'])删除具有空值的完整记录