Python 如何在multivariable/3D中实现核密度估计_Python_Numpy_Machine Learning_Scikit Learn_Kernel Density

Python 如何在multivariable/3D中实现核密度估计

python numpy machine-learning scikit-learn

Python 如何在multivariable/3D中实现核密度估计,python,numpy,machine-learning,scikit-learn,kernel-density,Python,Numpy,Machine Learning,Scikit Learn,Kernel Density,我的数据集如下fromat和im试图找出具有最佳带宽的内核密度估计 data = np.array([[1, 4, 3], [2, .6, 1.2], [2, 1, 1.2], [2, 0.5, 1.4], [5, .5, 0], [0, 0, 0], [1, 4, 3], [5, .5, 0], [2, .5, 1.2]]) 但我不知道该怎么做。还有如何找到矩阵∑ 更新我尝试了scikit学习工具包中的KDE函数，以找出单变量（1D）KDE 有谁能帮我把

我的数据集如下fromat和im试图找出具有最佳带宽的内核密度估计

data = np.array([[1, 4, 3], [2, .6, 1.2], [2, 1, 1.2],
         [2, 0.5, 1.4], [5, .5, 0], [0, 0, 0],
         [1, 4, 3], [5, .5, 0], [2, .5, 1.2]])

但我不知道该怎么做。还有如何找到矩阵∑

更新

我尝试了scikit学习工具包中的KDE函数，以找出单变量（1D）KDE

有谁能帮我把这个问题扩展到多变量/在这种情况下是三维数据吗？

有趣的问题。您有几个选择：

继续使用scikit学习

使用不同的库。例如，如果您感兴趣的内核是高斯型的，那么您可以使用更容易理解/应用的内核。这项技术有一个很好的例子

从第一原则出发，推出自己的产品。这是非常困难的，我不建议这样做

详细介绍了内核密度估计（KDE）的各种库实现的相对优点

我将向你展示什么是最简单的方法（在我看来——是的，这有点基于观点），我认为在你的案例中是选项2

注意此方法使用链接文档中所述的经验法则来确定带宽。使用的确切规则是斯科特规则。您提到∑矩阵使我认为经验法则带宽选择适合您，但您也谈到了最佳带宽，并且您给出的示例使用交叉验证来确定最佳带宽。因此，如果这种方法不适合您的目的-请在评论中告诉我

import numpy as np
from scipy import stats
data = np.array([[1, 4, 3], [2, .6, 1.2], [2, 1, 1.2],
         [2, 0.5, 1.4], [5, .5, 0], [0, 0, 0],
         [1, 4, 3], [5, .5, 0], [2, .5, 1.2]])

data = data.T #The KDE takes N vectors of length K for K data points
              #rather than K vectors of length N

kde = stats.gaussian_kde(data)

# You now have your kde!!  Interpreting it / visualising it can be difficult with 3D data
# You might like to try 2D data first - then you can plot the resulting estimated pdf
# as the height in the third dimension, making visualisation easier.

# Here is the basic way to evaluate the estimated pdf on a regular n-dimensional mesh
# Create a regular N-dimensional grid with (arbitrary) 20 points in each dimension
minima = data.T.min(axis=0)
maxima = data.T.max(axis=0)
space = [np.linspace(mini,maxi,20) for mini, maxi in zip(minima,maxima)]
grid = np.meshgrid(*space)

#Turn the grid into N-dimensional coordinates for each point
#Note - coords will get very large as N increases...
coords = np.vstack(map(np.ravel, grid))

#Evaluate the KD estimated pdf at each coordinate
density = kde(coords)

#Do what you like with the density values here..
#plot them, output them, use them elsewhere...

警告

这可能会产生可怕的结果，这取决于你的具体问题。要记住的事情显然是：

随着维度数量的增加，你想要的观测数据点的数量将呈指数增长-你的3维中9个点的样本数据非常稀疏-尽管我假设这些点表明事实上你还有更多

如正文中所述-带宽是以特定方式选择的-这可能会导致估计pdf的过度（或可以想象但不太可能欠）平滑

我想知道我是否能帮上忙，但我需要多了解一点。我可以看到每个数据点都有三个值，但按照您编写的方式，这些三元组进一步分组为三个组。输入数据分组两次是否有原因？还要再次检查一下∑矩阵的含义。我假设你指的是估计的数据协方差-所以你可以使用∑^（-1/2）的经验法则带宽？如果是的话，你是打算在这里开始带宽优化，还是代替优化？我的回答有帮助吗？如果没有-请随意添加一些注释，因为我可能会根据您的需要对其进行调整。@JRichardSnape您是对的，我以错误的方式对数据进行分组，实际上在我的代码中，它就像您实现的一样，但当复制代码时，我弄糟了。是的，∑是指协方差矩阵。但我仍然不确定我下面的答案是否有帮助——这能满足你的需求吗？或者你的问题还有别的吗？如果你想输出协方差矩阵，我可以加上。非常感谢你的帮助。这很有帮助。我还有两个问题，希望你能帮我解决。（1）我如何使用我自己的带宽，就像我们在sklearn.kde（gridcrossover）（2）中所做的那样，正如你所说，先在2d中绘制，然后在3d中绘制高度，你能告诉我如何在这里做我尝试过的吗？我会在有时间的时候看看这些问题。对不起，我没有机会看这个。1）您可以设置自己的带宽：我没有使用过这个，示例似乎适用于1D情况。在可视化方面——我建议从2D输入数据开始，而不是3D。你放在那个链接上的代码没有正确的导入等等，所以根本不会运行。嗨，你能帮我从你的代码中显示带宽矩阵吗。。我试过kde.factor，但它给了我一个浮点数。但对于多变量情形（3d），它不应该显示3x3带宽矩阵吗。谢谢你还需要看看kde.协方差

kde.factor

乘以

kde.convariance

得到我认为您希望看到的内核协方差矩阵或带宽（我认为您在上面称之为∑）。这在本书的底部有详细说明

import numpy as np
from scipy import stats
data = np.array([[1, 4, 3], [2, .6, 1.2], [2, 1, 1.2],
         [2, 0.5, 1.4], [5, .5, 0], [0, 0, 0],
         [1, 4, 3], [5, .5, 0], [2, .5, 1.2]])

data = data.T #The KDE takes N vectors of length K for K data points
              #rather than K vectors of length N

kde = stats.gaussian_kde(data)

# You now have your kde!!  Interpreting it / visualising it can be difficult with 3D data
# You might like to try 2D data first - then you can plot the resulting estimated pdf
# as the height in the third dimension, making visualisation easier.

# Here is the basic way to evaluate the estimated pdf on a regular n-dimensional mesh
# Create a regular N-dimensional grid with (arbitrary) 20 points in each dimension
minima = data.T.min(axis=0)
maxima = data.T.max(axis=0)
space = [np.linspace(mini,maxi,20) for mini, maxi in zip(minima,maxima)]
grid = np.meshgrid(*space)

#Turn the grid into N-dimensional coordinates for each point
#Note - coords will get very large as N increases...
coords = np.vstack(map(np.ravel, grid))

#Evaluate the KD estimated pdf at each coordinate
density = kde(coords)

#Do what you like with the density values here..
#plot them, output them, use them elsewhere...