是否需要添加计数为零的一年中的天数？（R中的犯罪分析）_R

是否需要添加计数为零的一年中的天数？（R中的犯罪分析）

是否需要添加计数为零的一年中的天数？（R中的犯罪分析）,r,R,我正在分析巴尔的摩一个地区的犯罪情况（5年的数据）。我正在为区内特定社区的特定犯罪类型创建折线图。然而，并非每个社区每天都报告所有的犯罪类型。因此，数据中没有计数为零的天数。只有在那些日子里，犯罪才被记录在数据中。这会在视觉上影响与x轴零接触的折线图数据。这是否会对stat_smooth创建的趋势线产生负面影响，以确定犯罪类型的增加/减少生成折线图的可复制代码： #Read crime data from GitHub repo into a R dataframe df = read.csv

我正在分析巴尔的摩一个地区的犯罪情况（5年的数据）。我正在为区内特定社区的特定犯罪类型创建折线图。然而，并非每个社区每天都报告所有的犯罪类型。因此，数据中没有计数为零的天数。只有在那些日子里，犯罪才被记录在数据中。这会在视觉上影响与x轴零接触的折线图数据。这是否会对stat_smooth创建的趋势线产生负面影响，以确定犯罪类型的增加/减少

生成折线图的可复制代码：

#Read crime data from GitHub repo into a R dataframe
df = read.csv("https://raw.githubusercontent.com/brianthomasbaker/Baltimore-Crime-Analysis/master/Baltimore_SE_Reported_Crime_2010_to_2014.csv", stringsAsFactors=FALSE, sep=",")

#Format CrimeDate column
df$CrimeDate = as.Date(df$CrimeDate, "%m/%d/%Y")

#Create new dataframe of only Larceny From Auto crimes by Day of the Year in Canton (2010-2014)
library(dplyr)
df_cantonlarcauto = df %>%
  filter(Neighborhood == "Canton", Description == "LARCENY FROM AUTO") %>%
  group_by(CrimeDate) %>%
  summarize(crimes = n())

#Create Line Chart using ggplot
library(ggplot2)
ggplot(df_cantonlarcauto, aes(x = CrimeDate, y = crimes, group=1)) +
  geom_line() +
  scale_size_area() +
  stat_smooth(method = "gam") +
  xlab("Year") +
  ylab("Number of Crimes") +
  ylim(0,13) +
  theme(plot.title = element_text(family = "Trebuchet MS", color="#666666", face="bold", size=32, hjust=0)) +
  theme(axis.title = element_text(family = "Trebuchet MS", color="#666666", face="bold", size=22)) +
  ggtitle("Larceny From Auto\nCanton (2010-2014)")

head(df_cantonlarcauto)

您可以在数据帧的标题中看到1月2日和3日丢失。是否应将缺失的天数和这些天数的零计数添加到数据中？如果是这样的话，你怎么能在R里做到呢？或者，这些天的省略是否不会对分析犯罪数据的尝试产生负面影响？

您可以创建一个完整的日期序列，并为名称中没有犯罪的数据添加NAs。下面是一种肮脏的做法：

xy <- data.frame(CrimeDate = seq(df_cantonlarcauto$CrimeDate[1], to = df_cantonlarcauto$CrimeDate[nrow(df_cantonlarcauto)], by = 1))
xy <- merge(xy, df_cantonlarcauto, all.x = TRUE)

ggplot(xy, aes(x = CrimeDate, y = crimes, group=1)) +
    geom_line() +
    scale_size_area() +
    stat_smooth(method = "gam") +
    xlab("Year") +
    ylab("Number of Crimes") +
    ylim(0,13) +
    theme(plot.title = element_text(family = "Trebuchet MS", color="#666666", face="bold", size=32, hjust=0)) +
    theme(axis.title = element_text(family = "Trebuchet MS", color="#666666", face="bold", size=22)) +
    ggtitle("Larceny From Auto\nCanton (2010-2014)")

xy您可以通过以下方式添加缺少的日期：
library(dplyr)
df_cantonlarcauto_missing = data_frame(CrimeDate = seq(min(df_cantonlarcauto$CrimeDate), max(df_cantonlarcauto$CrimeDate), 1)) %>% 
  left_join(df_cantonlarcauto)

如果您使用这个数据帧进行绘图（ggplot（df_cantonlarauto_missing，aes（x=CrimeDate，y=CrimeDate，group=1））+…），您应该已经看到了一个更好看的绘图
我不知道这些数据，但我个人的建议是，现在就看这些数据，强制将缺失的日期设为0，然后进行某种聚合（如每周滚动平均值），因为这些值非常低，而且经常缺失/0：
df_cantonlarcauto_missing = data_frame(CrimeDate = seq(min(df_cantonlarcauto$CrimeDate), max(df_cantonlarcauto$CrimeDate), 1)) %>% 
  left_join(df_cantonlarcauto) %>% 
  mutate(crimes = ifelse(is.na(crimes), 0, crimes)) %>% 
  mutate(crimes = c(rep(NA, 6), rollmean(crimes, 7, align = "right")))

ggplot(df_cantonlarcauto_missing, aes(x = CrimeDate, y = crimes, group=1)) +
  geom_line() +
  scale_size_area() +
  stat_smooth(method = "gam") +
  xlab("Year") +
  ylab("Number of Crimes") +
  # ylim(0,13) +
  theme(plot.title = element_text(family = "Trebuchet MS", color="#666666", face="bold", size=32, hjust=0)) +
  theme(axis.title = element_text(family = "Trebuchet MS", color="#666666", face="bold", size=22)) +
  ggtitle("Larceny From Auto\nCanton (2010-2014)")

你的伟大头脑比我的思维快26秒：-）将缺失的日期添加为“NA”，stat_smooth将删除包含缺失值的行。谢谢。实现滚动周平均值只会产生（在本例中）6行包含缺失值。@LorenzoRossi出于某种原因，我认为mutate（crimes=c（rep（NA，6），rollmean（crimes，7，align=“right”）
会强制前6行用0覆盖值。有没有办法解决这个问题？每周滚动平均数是指连续7天的平均值。因此，对于数据框中的前6个日期，滚动平均值将不存在有效值，因为您没有足够的日期来计算每周滚动平均值。从第7天开始，您将有足够的历史日期来正确计算滚动平均值。这就是为什么在我的mutate中，我显式地强制我的前6个值为NA的原因。我认为这可能比你想象的更糟糕——这条线永远不会下降到0；相反，它是在1个值之间填充的——如果你设置scale\u y\u continuous（breaks=0:10）
，你就会明白我的意思。通过使用stat\u smooth
和predict（gam（…）
，似乎使用NA
或缺少的值会严重影响平滑效果。我认为，从数据分析的角度来看，您必须将0
填入缺失的值，因为它们没有缺失，而是与值0
一起出现。