R 为什么MASS:：lda的行为是非确定性的？_R

R 为什么MASS:：lda的行为是非确定性的？

R 为什么MASS:：lda的行为是非确定性的？,r,R,我最近做了一些工作，详细研究了lda的行为，发现对于接近决策边界的观测，predict.lda返回非确定性类。起初我认为这可能是一个数字精度问题，但从决策边界来看，预测数据似乎在1e-6的数量级上，远远高于双精度。。。。我编写了一个最小（ish）示例，请参见以下内容： # Fit an LDA model to a subset of Fisher's Iris data x = as.matrix(iris[iris$Species != 'setosa', 1:4]) y = as.fact

我最近做了一些工作，详细研究了lda的行为，发现对于接近决策边界的观测，predict.lda返回非确定性类。起初我认为这可能是一个数字精度问题，但从决策边界来看，预测数据似乎在1e-6的数量级上，远远高于双精度。。。。我编写了一个最小（ish）示例，请参见以下内容：

# Fit an LDA model to a subset of Fisher's Iris data
x = as.matrix(iris[iris$Species != 'setosa', 1:4])
y = as.factor(as.character(iris[iris$Species != 'setosa', 'Species']))
m = MASS::lda(x, y)

# Generate data near the decision boundary
d = m$scaling
ord = order(x %*% d)
y.pred = MASS:::predict.lda(m, newdata = x)$class
ind = min(which(y.pred[ord] == 'virginica'))
# Interpolate between the two data points on either side of the decision boundary
s = seq(0, 1, length.out = 100001)
s = s[47479:47484] # Zoom on the decision boundary
x.test = (as.matrix(s) %*% t(x[ord[ind - 1], ])) + (as.matrix(1 - s) %*% t(x[ord[ind], ]))

# running predict.lda() on x.test seems to generate non-deterministic results.
# set.seed(123) # set.seed here seems to remove the non-determinism.
for (i in 1:10) {
  y.pred = MASS:::predict.lda(m, newdata = x.test)$class
  print(as.character(y.pred))
}

如果有什么不同，下面是sessionInfo输出：

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8   
 [6] LC_MESSAGES=en_AU.UTF-8    LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] MASS_7.3-51.4  compiler_3.6.1 tools_3.6.1

predict

方法是寻找联系，并使用随机选择来打破联系。这就是为什么设置随机种子使其具有确定性

为了找到随机性的来源，我编辑了一份

MASS:：：predict.lda

的副本以添加如下行：

cat("after cl:"); print(rnorm(1)); set.seed(123); print(rnorm(1)); set.seed(123)

如果在此之后使用随机数生成器，则

rnorm（1）

的下一个值将不同于直接在

set.seed（123）

调用之后的值

下面是编辑后的

predict.lda

源代码的一部分，说明了这个问题：

posterior <- dist/drop(dist %*% rep(1, ng))
cat("after posterior:"); print(rnorm(1)); set.seed(123); print(rnorm(1)); set.seed(123)

nm <- names(object$prior)
cl <- factor(nm[max.col(posterior)], levels = object$lev)
cat("after cl:"); print(rnorm(1)); set.seed(123); print(rnorm(1)); set.seed(123)

然后，如果查看

cl

计算中使用的

max.col

帮助，您可以看到

当ties.method=“random”时，默认情况下，在随机的在这种情况下，平局的确定假设条目为概率：存在1e-5的相对公差，相对于数据中的最大（数量级，省略无穷大）条目划船

回答得好！非常感谢。

after posterior:[1] -0.5604756
[1] -0.5604756
after cl:[1] 1.558708
[1] -0.5604756
[1] "virginica"  "virginica"  "versicolor" "virginica" 
[5] "versicolor" "versicolor"