R使用apply在特定时间段内计算大型数据帧中特定行的一些统计信息_R_For Loop_Apply

R使用apply在特定时间段内计算大型数据帧中特定行的一些统计信息

r for-loop

R使用apply在特定时间段内计算大型数据帧中特定行的一些统计信息,r,for-loop,apply,R,For Loop,Apply,编辑：我对问题进行了编辑，以包含更多信息以及为什么我不能只使用简单的求和/均值函数：我需要在一定的时间范围内计算特定行子集的总和，并且使用for循环可以很好地工作，但是我的数据集很大，并且for循环不是很有效，所以我尝试使用apply函数，但是无法使其工作下面是一个简单数据集的工作示例。我需要的是计算每个类别的总和，但只计算在最后一个小时内具有该值的任何类别的总和，并将其作为新列添加到此数据集。我正在使用unix使它更简单，并且只使用数字（注：事实上，我想知道过去一个小时内来往于A站和B

编辑：

我对问题进行了编辑，以包含更多信息以及为什么我不能只使用简单的求和/均值函数：

我需要在一定的时间范围内计算特定行子集的总和，并且使用for循环可以很好地工作，但是我的数据集很大，并且for循环不是很有效，所以我尝试使用apply函数，但是无法使其工作

下面是一个简单数据集的工作示例。我需要的是计算每个类别的总和，但只计算在最后一个小时内具有该值的任何类别的总和，并将其作为新列添加到此数据集。我正在使用unix使它更简单，并且只使用数字

（注：事实上，我想知道过去一个小时内来往于A站和B站之间的所有列车的平均晚点——只是为了将其纳入上下文）

请注意，第一次

发生在下午12时05分，第二次发生在下午12时00分，第三次发生在上午8时。因此，当我想得到前一个小时的平均值时，第一个

在前一个小时没有其他

，而第二个

在前5分钟有自己和另一个

，因此需要从第一个和第二个

计算总和

然后，我运行for循环来计算每个类别的总和，并将其添加到先前创建的总和列中：

#Run it as a loop: 
for (i in 1:nrow(d)){
  pickcategory = d[i,c("category")]   #here I select my categroy that I want to filter on
  pickunix = d[i,c("time")]
  filterrows = filter(d, grepl(pickcategory,category)) #here I am subsetting the entire dataframe for only those rows containing this category
  filterhour = filterrows[filterrows$time <= pickunix & filterrows$time > (pickunix-3600),] #subset for previous hour
  getsum = sum(filterhour$value)  #get the mean value for that category
  d$sum[i] = getsum  #add that mean value to that row
}

但是：这对于1亿行来说是很慢的，所以我尝试了apply，但没有成功。关于apply的所有教程都非常简单，人们不使用自定义编写的函数，所以我无法理解

它在第一行的

pickcategory

中已经出错，因为它选择了整个列，而不是我要在其上运行函数的行的值

testfunction= function(d){
  pickcategory = d$category   #here I select my categroy that I want to filter on
  pickunix = d$time
  filterrows = filter(d, grepl(pickcategory,category)) #here I am subsetting the entire dataframe for only those rows containing this category
  filterhour = filterrows[filterrows$time <= pickunix & filterrows$time > (pickunix-3600),] #subset for previous hour
  getsum = sum(filterhour$value)  #get the mean value for that category
  d$sum = getsum #add that mean value to that row
  return(d)
  }

output = apply(d, 1, function(x) testfunction(d))

testfunction=函数（d）{
pickcategory=d$category#在这里，我选择我要筛选的类别
pickunix=d$time
filterrows=filter（d，grepl（pickcategory，category））#这里我只为包含此类别的行设置整个数据帧的子集
filterhour=filterrows[filterrows$time（pickunix-3600），]#前一小时的子集
getsum=sum（filterhour$value）#获取该类别的平均值
d$sum=getsum#将该平均值添加到该行
返回（d）
}
输出=应用（d，1，函数（x）测试函数（d））

有人能告诉我如何使for循环成为一个有效的apply函数吗

请注意，我的实际示例没有计算总和，而是更复杂的，因此这需要用于我想要进行的任何类型的计算和类别选择

任何帮助都将不胜感激

试着过滤你的

d$sum

当你填充它时，沿着这条线的一些东西应该会做这只是一个简单的“求和/分组平均”问题。如果您非常需要它，请尝试

data.table

。只需使用

d%groupby（category）%%>%mutate（sum=sum（value））

侧注：为避免手动删除因子，请在

data.frame

调用中添加

stringsAsFactors=FALSE

。顺便说一句：为列指定与函数相同的名称是不明智的

  category       time value sum
1        a 1444305900     1   4
2        b 1444306587     2   2
3        a 1444305600     3   3
4        c 1444291200     4   4
5        a 1444291900     5   5

testfunction= function(d){
  pickcategory = d$category   #here I select my categroy that I want to filter on
  pickunix = d$time
  filterrows = filter(d, grepl(pickcategory,category)) #here I am subsetting the entire dataframe for only those rows containing this category
  filterhour = filterrows[filterrows$time <= pickunix & filterrows$time > (pickunix-3600),] #subset for previous hour
  getsum = sum(filterhour$value)  #get the mean value for that category
  d$sum = getsum #add that mean value to that row
  return(d)
  }

output = apply(d, 1, function(x) testfunction(d))