R 选择在一个时间范围内显示至少n条记录的所有唯一条目

R 选择在一个时间范围内显示至少n条记录的所有唯一条目,r,filter,range,subset,R,Filter,Range,Subset,我有以下数据集32000项水化学化合物,按监测点和采样年组织,示例如下: data= data.frame(Site_ID=c(1, 1, 1, 2, 2, 2, 3, 3, 3), Year=c(1976, 1977, 1978, 2004, 2005, 2006, 2003, 2004, 2005), AnnualMean=c(1.1, 1.2, 1.1, 2.1, 2.6, 3.1, 2.7, 2.6, 1.9)) 我只想选择所有监测点的数据,这些监测点在第1年和第2年之间至少有n个测量

我有以下数据集32000项水化学化合物,按监测点和采样年组织,示例如下:

data= data.frame(Site_ID=c(1, 1, 1, 2, 2, 2, 3, 3, 3), Year=c(1976, 1977, 1978, 2004, 2005, 2006, 2003, 2004, 2005), AnnualMean=c(1.1, 1.2, 1.1, 2.1, 2.6, 3.1, 2.7, 2.6, 1.9))
我只想选择所有监测点的数据,这些监测点在第1年和第2年之间至少有n个测量值?通常,我想从监测站点中选择所有数据,这些数据显示1990年至2005年间的10个测量值。到目前为止,我尝试过,但没有成功:

data %>%
group_by(Site_ID) %>%
filter(n()>=n %in% between(Year, year1, year2))

Base-R中的这段代码使用您提供的示例数据。您可以将IDstoGet=3]中的数字更改为仅获取数据点数量超过您所需数量的站点ID

DataInRange <- data[(data$Year>=1990&data$Year<=2005),]
Site_IDs <- unique(DataInRange$Site_ID)
CountBySite_IDs <- sapply(Site_IDs, function(x) length(grep(x,DataInRange$Site_ID)))
IDstoGet <- Site_IDs[CountBySite_IDs >= 3]
DataToGetPosition <- unlist(lapply(IDstoGet, grep, DataInRange$Site_ID))

DataInRange[DataToGetPosition,]

Base-R中的这段代码使用您提供的示例数据。您可以将IDstoGet=3]中的数字更改为仅获取数据点数量超过您所需数量的站点ID

DataInRange <- data[(data$Year>=1990&data$Year<=2005),]
Site_IDs <- unique(DataInRange$Site_ID)
CountBySite_IDs <- sapply(Site_IDs, function(x) length(grep(x,DataInRange$Site_ID)))
IDstoGet <- Site_IDs[CountBySite_IDs >= 3]
DataToGetPosition <- unlist(lapply(IDstoGet, grep, DataInRange$Site_ID))

DataInRange[DataToGetPosition,]

我不确定这是否是你期望的结果,也许你可以试一试

data %>%
  group_by(Site_ID) %>%
  filter(between(Year,1990,2005)) %>%
  filter(Year, n()>=10)
一个基本的R替代方案是

subset(data,
       !!ave(ave(Year,
                 Site_ID,
                 FUN = function(x) x>=1990&x<=2005),
             Site_ID,
             FUN = function(x) sum(x)>2))

我不确定这是否是你期望的结果,也许你可以试一试

data %>%
  group_by(Site_ID) %>%
  filter(between(Year,1990,2005)) %>%
  filter(Year, n()>=10)
一个基本的R替代方案是

subset(data,
       !!ave(ave(Year,
                 Site_ID,
                 FUN = function(x) x>=1990&x<=2005),
             Site_ID,
             FUN = function(x) sum(x)>2))
这将选择在PARM['yr1']和PARM['yr2']之间至少包含PARM['n']观测值的所有站点ID组

这将选择在PARM['yr1']和PARM['yr2']之间至少包含PARM['n']观测值的所有站点ID组

基本R解决方案:

# Store a scalar that's values represent the number of observations 
# You would like to filter the data set for: n => numeric vector: 
n <- 10

# Append a site count vector to a subset of the original df: sites_counted_df => data.frame:
sites_counted_df <-
  within(data[which(data$Year >= 1980 & data$Year <= 2005), ],
         {
           count <- ave(Site_ID, Site_ID, FUN = length)
         }
  )

# Filter the data.frame to contain records for sites above "n":
# n_observation_sites => data.frame
n_observation_sites <- sites_counted_df[which(count > n),]
数据:

data <- data.frame(
  Site_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
  Year = c(1976, 1977, 1978, 2004, 2005, 2006, 2003, 2004, 2005),
  AnnualMean = c(1.1, 1.2, 1.1, 2.1, 2.6, 3.1, 2.7, 2.6, 1.9)
)
基本R解决方案:

# Store a scalar that's values represent the number of observations 
# You would like to filter the data set for: n => numeric vector: 
n <- 10

# Append a site count vector to a subset of the original df: sites_counted_df => data.frame:
sites_counted_df <-
  within(data[which(data$Year >= 1980 & data$Year <= 2005), ],
         {
           count <- ave(Site_ID, Site_ID, FUN = length)
         }
  )

# Filter the data.frame to contain records for sites above "n":
# n_observation_sites => data.frame
n_observation_sites <- sites_counted_df[which(count > n),]
数据:

data <- data.frame(
  Site_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
  Year = c(1976, 1977, 1978, 2004, 2005, 2006, 2003, 2004, 2005),
  AnnualMean = c(1.1, 1.2, 1.1, 2.1, 2.6, 3.1, 2.7, 2.6, 1.9)
)

此解决方案非常有效,非常感谢您,并对延迟回复表示歉意!此解决方案非常有效,非常感谢您,并对延迟回复表示歉意!