Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/algorithm/10.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Algorithm 核密度估计_Algorithm_Machine Learning_Julia - Fatal编程技术网

Algorithm 核密度估计

Algorithm 核密度估计,algorithm,machine-learning,julia,Algorithm,Machine Learning,Julia,我正在尝试实现一个内核密度估计。然而,我的代码没有提供它应该提供的答案。它也是用julia编写的,但是代码应该是自解释的 以下是算法: 在哪里 因此,该算法测试x与某个观测值x_i之间的距离是否小于1,该观测值由某个常数因子(binwidth)加权。如果是这样,它为该值赋值0.5/(n*h),其中n=观测值 以下是我的实现: #Kernel density function. #Purpose: estimate the probability density function (pdf)

我正在尝试实现一个内核密度估计。然而,我的代码没有提供它应该提供的答案。它也是用julia编写的,但是代码应该是自解释的

以下是算法:

在哪里

因此,该算法测试x与某个观测值x_i之间的距离是否小于1,该观测值由某个常数因子(binwidth)加权。如果是这样,它为该值赋值0.5/(n*h),其中n=观测值

以下是我的实现:

#Kernel density function.
#Purpose: estimate the probability density function (pdf)
#of given observations
#@param data: observations for which the pdf should be estimated
#@return: returns an array with the estimated densities 

function kernelDensity(data)
|   
|   #Uniform kernel function. 
|   #@param x: Current x value
|   #@param X_i: x value of observation i
|   #@param width: binwidth
|   #@return: Returns 1 if the absolute distance from
|   #x(current) to x(observation) weighted by the binwidth
|   #is less then 1. Else it returns 0.
|  
|   function uniformKernel(x, observation, width)
|   |   u = ( x - observation ) / width
|   |   abs ( u ) <= 1 ? 1 : 0
|   end
|
|   #number of observations in the data set 
|   n = length(data)
|
|   #binwidth (set arbitraily to 0.1
|   h = 0.1 
|   
|   #vector that stored the pdf
|   res = zeros( Real, n )
|   
|   #counter variable for the loop 
|   counter = 0
|
|   #lower and upper limit of the x axis
|   start = floor(minimum(data))
|   stop = ceil (maximum(data))
|
|   #main loop
|   #@linspace: divides the space from start to stop in n
|   #equally spaced intervalls
|   for x in linspace(start, stop, n) 
|   |   counter += 1
|   |   for observation in data
|   |   |
|   |   |   #count all observations for which the kernel
|   |   |   #returns 1 and mult by 0.5 because the
|   |   |   #kernel computed the absolute difference which can be
|   |   |   #either positive or negative
|   |   |   res[counter] += 0.5 * uniformKernel(x, observation, h)
|   |   end
|   |   #devide by n times h
|   |   res[counter] /= n * h
|   end
|   #return results
|   res
end
#run function
#@rand: generates 10 uniform random numbers between 0 and 1
kernelDensity(rand(10))
其和为:8.5(累积分布函数应为1。)

因此有两个bug:

  • 这些值未正确缩放。每个数字应该是当前值的十分之一左右。事实上,如果观察次数增加10^n=1,2。。。然后cdf也会增加10^n
  • 例如:

    > kernelDensity(rand(1000))
    > 953.53 
    
  • 它们的总和不等于10(如果不是因为缩放错误,也不等于1)。随着样本量的增加,误差变得更加明显:约有5%的观察结果未包括在内

  • 我相信我实现了公式1:1,因此我真的不知道错误在哪里。

    我不是KDE方面的专家,所以请恕我直言,但代码的一个非常类似(但要快得多!)的实现是:

    function kernelDensity{T<:AbstractFloat}(data::Vector{T}, h::T)
      res = similar(data)
      lb = minimum(data); ub = maximum(data)
      for (i,x) in enumerate(linspace(lb, ub, size(data,1)))
        for obs in data
          res[i] += abs((obs-x)/h) <= 1. ? 0.5 : 0.
        end
        res[i] /= (n*h)
     end
     sum(res)
    end
    

    函数kernelDensity{T指出错误:您有n个大小为2h的bins B_i,覆盖[0,1],一个随机点X落在预期数量的bins中。您除以2N h

    对于n个点,函数的预期值为

    实际上,您有一些大小小于2h的箱子(例如,如果start=0,则第一个箱子的一半在[0,1]之外),将其考虑在内会产生偏差


    编辑:顺便说一句,如果你假设箱子在[0,1]中有随机位置,那么偏差很容易计算。然后箱子平均缺少h/2=其大小的5%。

    谢谢你,代码和库都没有找到。是的,你是对的!没有考虑前半部分。
    function kernelDensity{T<:AbstractFloat}(data::Vector{T}, h::T)
      res = similar(data)
      lb = minimum(data); ub = maximum(data)
      for (i,x) in enumerate(linspace(lb, ub, size(data,1)))
        for obs in data
          res[i] += abs((obs-x)/h) <= 1. ? 0.5 : 0.
        end
        res[i] /= (n*h)
     end
     sum(res)
    end