Python 3.x 为什么sklearn的标准化数据方差不等于1？_Python 3.x_Scikit Learn_Scientific Notation

Python 3.x 为什么sklearn的标准化数据方差不等于1？

python-3.x scikit-learn

Python 3.x 为什么sklearn的标准化数据方差不等于1？,python-3.x,scikit-learn,scientific-notation,Python 3.x,Scikit Learn,Scientific Notation,我使用包sklearn中的preprocessing对数据进行规范化，如下所示： import pandas as pd import urllib3 from sklearn import preprocessing decathlon = pd.read_csv("https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/decathlon.txt", sep='\t') decathlon.d

我使用包

sklearn

中的

preprocessing

对数据进行规范化，如下所示：

import pandas as pd
import urllib3
from sklearn import preprocessing

decathlon = pd.read_csv("https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/decathlon.txt", sep='\t')
decathlon.describe()

nor_df = decathlon.copy()
nor_df.iloc[:, 0:10] = preprocessing.scale(decathlon.iloc[:, 0:10])
nor_df.describe()

结果是

平均值为

-1.516402e-16

，几乎为0。相反，方差为

1.012423e+00

，即

1.012423

。对我来说，

1.012423

不被视为接近1

你能详细说明一下这种现象吗？

在这种情况下，sklearn和pandas计算

std

的方法不同

sklearn.preprocessing.scale

：

我们对标准偏差使用有偏估计，相当于

numpy.std（x，ddof=0）

。请注意，

ddof

的选择不太可能影响模型性能

pandas.Dataframe.descripe

使用

pandas.core.series.series.std

其中：

默认情况下，标准化为N-1。这可以使用ddof参数进行更改

ddof:int，默认值为1 自由度增量。计算中使用的除数是N-ddof，其中N表示元素的数量

需要注意的是，在2020-10-28年，

pandas.Dataframe.descripe

没有

ddof

参数，因此

ddof=1的默认值总是用于系列
非常感谢。现在清楚了。一个小建议：Dataframe.description
使用numpy.std（x，ddof=1）
，而不是numpy.std（x）
@LAD我在中找不到对ddof=1
的直接引用。这是因为numpy.std（x）
默认情况下使用ddof=0
@LAD我更新了答案，直接引用了源代码/文档。非常感谢您的帮助。