按系数将MASS:：fitdire应用于多个数据_R_Ggplot2_Dplyr_Apply_Beta Distribution

按系数将MASS:：fitdire应用于多个数据

按系数将MASS:：fitdire应用于多个数据,r,ggplot2,dplyr,apply,beta-distribution,R,Ggplot2,Dplyr,Apply,Beta Distribution,我的问题以粗体结尾。我知道如何使beta分布适合某些数据。例如： library(Lahman) library(dplyr) # clean up the data and calculate batting averages by playerID batting_by_decade <- Batting %>% filter(AB > 0) %>% group_by(playerID, Decade = round(yearID - 5, -1)) %&

我的问题以粗体结尾。

我知道如何使beta分布适合某些数据。例如：

library(Lahman)
library(dplyr)

# clean up the data and calculate batting averages by playerID
batting_by_decade <- Batting %>%
  filter(AB > 0) %>%
  group_by(playerID, Decade = round(yearID - 5, -1)) %>%
  summarize(H = sum(H), AB = sum(AB)) %>%
  ungroup() %>%
  filter(AB > 500) %>%
  mutate(average = H / AB)

# fit the beta distribution
library(MASS)
m <- MASS::fitdistr(batting_by_decade$average, dbeta,
                    start = list(shape1 = 1, shape2 = 10))

alpha0 <- m$estimate[1]
beta0 <- m$estimate[2]

# plot the histogram of data and the beta distribution
ggplot(career_filtered) +
  geom_histogram(aes(average, y = ..density..), binwidth = .005) +
  stat_function(fun = function(x) dbeta(x, alpha0, beta0), color = "red",
                size = 1) +
  xlab("Batting average")

我可以通过对每个十年进行过滤来硬编码，并将该十年的数据传递到

fidistr

函数中，在所有十年中重复此操作，但是是否有一种方法可以快速、重复地计算每个十年的所有beta参数，可能使用其中一个应用函数？

您可以利用

摘要

和两个自定义函数来实现此目的：

getAlphaEstimate = function(x) {MASS::fitdistr(x, dbeta,start = list(shape1 = 1, shape2 = 10))$estimate[1]}

getBetaEstimate = function(x) {MASS::fitdistr(x, dbeta,start = list(shape1 = 1, shape2 = 10))$estimate[2]}

batting_by_decade %>%
  group_by(Decade) %>%
  summarise(alpha = getAlphaEstimate(average),
         beta = getBetaEstimate(average)) -> decadeParameters

但是，根据Hadley的帖子，您将无法使用

stat\u summary

绘制它：

这是一个应用解决方案，但我更喜欢@CMichael的dplyr解决方案

calc_beta <- function(decade){
  dummy <- batting_by_decade %>% 
    dplyr::filter(Decade == decade) %>% 
    dplyr::select(average)

  m <- fitdistr(dummy$average, dbeta, start = list(shape1 = 1, shape2 = 10))

  alpha0 <- m$estimate[1]
  beta0 <- m$estimate[2]

  return(c(alpha0,beta0))
}

decade <- seq(1870, 2010, by =10)
params <- sapply(decade, calc_beta)
colnames(params) <- decade

现在我们只需要以一种很好的方式取消第二列的列表….

下面是一个示例，说明如何从生成虚拟数据一直到绘图

temp.df <- data_frame(yr = 10*187:190,
                      al = rnorm(length(yr), mean = 4, sd = 2),
                      be = rnorm(length(yr), mean = 10, sd = 2)) %>% 
  group_by(yr, al, be) %>% 
  do(data_frame(dats = rbeta(100, .$al, .$be)))

这是你的大部分问题，你解决问题的方式非常好。如果逐行浏览此代码段，您将看到一个数据框，其中的列类型为

list

，每行包含

。当您

unnest（）

将这两个数字拆分为单独的行，然后我们通过添加一列“a，b，a，b，…”来识别它们，并

将它们重新分开，得到每年一行的两列。在这里，您还可以看到fitdistr
与我们抽样的真实人群的匹配程度，查看a
vsal
和b
vsbe


最后，我们将其放在一起，绘制样本数据的直方图。然后用曲线数据中的一条线表示真实密度。然后从我们的曲线数据中得到一条线，作为我们估计的密度。然后从我们的参数估计数据中提取一些标签，以显示样本参数，以及按年份划分的面
我非常喜欢这个答案。它比我做的要优雅得多，见下文。谢谢你，迈克尔！我也不知道你能以一个任务结束一个管道。很酷。谢谢你-我记得我的一个学生第一次在管道末端使用作业时，我被你能做到这一点吓了一跳。我觉得它真的很优雅。另外，我觉得应该有一种方法来避免代码中重复的fitdistr
调用，这在大数据场景中可能代价高昂，但我就是没有想到；）尽管停止了有关管道的stackoverflow文档，但在管道变体方面有一个不错的部分：我有一个想法：避免重复的FitDisr
，我刚刚在我的帖子中提到了这个想法。它唯一缺少的是未列出data.frame.coulse列表返回值的第二列-之后，您可以查看处理许多模型的broom
包。哈德利的R4DS有一个非常好的章节：基本上你一路管理列表列。非常好。我现在讲的是第5章，但当我讲到第25章时，我会回到这篇文章。对于未列出，您可以使用tidyr:：unnest（）。
getAlphaBeta = function(x) {MASS::fitdistr(x, dbeta,start = list(shape1 = 1, shape2 = 10))$estimate}

batting_by_decade %>%
  group_by(Decade) %>%
  summarise(params = list(getAlphaBeta(average))) -> decadeParameters

decadeParameters$params[1] # it works!

temp.df <- data_frame(yr = 10*187:190,
                      al = rnorm(length(yr), mean = 4, sd = 2),
                      be = rnorm(length(yr), mean = 10, sd = 2)) %>% 
  group_by(yr, al, be) %>% 
  do(data_frame(dats = rbeta(100, .$al, .$be)))

temp.ests <- temp.df %>% 
  group_by(yr, al, be) %>% 
  summarise(ests = list(MASS::fitdistr(dats, dbeta, start = list(shape1 = 1, shape2 = 1))$estimate)) %>% 
  unnest %>% 
  mutate(param = rep(letters[1:2], length(ests)/2)) %>% 
  spread(key = param, value = ests)

temp.curves <- temp.ests %>% 
  group_by(yr, al, be, a, b) %>% 
  do(data_frame(prop = 1:99/100,
                trueden = dbeta(prop, .$al, .$be),
                estden = dbeta(prop, .$a, .$b)))

ggplot() +
  geom_histogram(data = temp.df, aes(dats, y = ..density..), colour = "black", fill = "white") +
  geom_line(data = temp.curves, aes(prop, trueden, color = "population"), size = 1) +
  geom_line(data = temp.curves, aes(prop, estden, color = "sample"), size = 1) +
  geom_text(data = temp.ests, 
            aes(1, 2, label = paste("hat(alpha)==", round(a, 2))), 
            parse = T, hjust = 1) +
  geom_text(data = temp.ests, 
            aes(1, 1, label = paste("hat(beta)==", round(b, 2))), 
            parse = T, hjust = 1) +
  facet_wrap(~yr)