R 选择在一个时间范围内显示至少n条记录的所有唯一条目
我有以下数据集32000项水化学化合物,按监测点和采样年组织,示例如下:R 选择在一个时间范围内显示至少n条记录的所有唯一条目,r,filter,range,subset,R,Filter,Range,Subset,我有以下数据集32000项水化学化合物,按监测点和采样年组织,示例如下: data= data.frame(Site_ID=c(1, 1, 1, 2, 2, 2, 3, 3, 3), Year=c(1976, 1977, 1978, 2004, 2005, 2006, 2003, 2004, 2005), AnnualMean=c(1.1, 1.2, 1.1, 2.1, 2.6, 3.1, 2.7, 2.6, 1.9)) 我只想选择所有监测点的数据,这些监测点在第1年和第2年之间至少有n个测量
data= data.frame(Site_ID=c(1, 1, 1, 2, 2, 2, 3, 3, 3), Year=c(1976, 1977, 1978, 2004, 2005, 2006, 2003, 2004, 2005), AnnualMean=c(1.1, 1.2, 1.1, 2.1, 2.6, 3.1, 2.7, 2.6, 1.9))
我只想选择所有监测点的数据,这些监测点在第1年和第2年之间至少有n个测量值?通常,我想从监测站点中选择所有数据,这些数据显示1990年至2005年间的10个测量值。到目前为止,我尝试过,但没有成功:
data %>%
group_by(Site_ID) %>%
filter(n()>=n %in% between(Year, year1, year2))
Base-R中的这段代码使用您提供的示例数据。您可以将IDstoGet=3]中的数字更改为仅获取数据点数量超过您所需数量的站点ID
DataInRange <- data[(data$Year>=1990&data$Year<=2005),]
Site_IDs <- unique(DataInRange$Site_ID)
CountBySite_IDs <- sapply(Site_IDs, function(x) length(grep(x,DataInRange$Site_ID)))
IDstoGet <- Site_IDs[CountBySite_IDs >= 3]
DataToGetPosition <- unlist(lapply(IDstoGet, grep, DataInRange$Site_ID))
DataInRange[DataToGetPosition,]
Base-R中的这段代码使用您提供的示例数据。您可以将IDstoGet=3]中的数字更改为仅获取数据点数量超过您所需数量的站点ID
DataInRange <- data[(data$Year>=1990&data$Year<=2005),]
Site_IDs <- unique(DataInRange$Site_ID)
CountBySite_IDs <- sapply(Site_IDs, function(x) length(grep(x,DataInRange$Site_ID)))
IDstoGet <- Site_IDs[CountBySite_IDs >= 3]
DataToGetPosition <- unlist(lapply(IDstoGet, grep, DataInRange$Site_ID))
DataInRange[DataToGetPosition,]
我不确定这是否是你期望的结果,也许你可以试一试
data %>%
group_by(Site_ID) %>%
filter(between(Year,1990,2005)) %>%
filter(Year, n()>=10)
一个基本的R替代方案是
subset(data,
!!ave(ave(Year,
Site_ID,
FUN = function(x) x>=1990&x<=2005),
Site_ID,
FUN = function(x) sum(x)>2))
我不确定这是否是你期望的结果,也许你可以试一试
data %>%
group_by(Site_ID) %>%
filter(between(Year,1990,2005)) %>%
filter(Year, n()>=10)
一个基本的R替代方案是
subset(data,
!!ave(ave(Year,
Site_ID,
FUN = function(x) x>=1990&x<=2005),
Site_ID,
FUN = function(x) sum(x)>2))
这将选择在PARM['yr1']和PARM['yr2']之间至少包含PARM['n']观测值的所有站点ID组
这将选择在PARM['yr1']和PARM['yr2']之间至少包含PARM['n']观测值的所有站点ID组
基本R解决方案:
# Store a scalar that's values represent the number of observations
# You would like to filter the data set for: n => numeric vector:
n <- 10
# Append a site count vector to a subset of the original df: sites_counted_df => data.frame:
sites_counted_df <-
within(data[which(data$Year >= 1980 & data$Year <= 2005), ],
{
count <- ave(Site_ID, Site_ID, FUN = length)
}
)
# Filter the data.frame to contain records for sites above "n":
# n_observation_sites => data.frame
n_observation_sites <- sites_counted_df[which(count > n),]
数据:
data <- data.frame(
Site_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
Year = c(1976, 1977, 1978, 2004, 2005, 2006, 2003, 2004, 2005),
AnnualMean = c(1.1, 1.2, 1.1, 2.1, 2.6, 3.1, 2.7, 2.6, 1.9)
)
基本R解决方案:
# Store a scalar that's values represent the number of observations
# You would like to filter the data set for: n => numeric vector:
n <- 10
# Append a site count vector to a subset of the original df: sites_counted_df => data.frame:
sites_counted_df <-
within(data[which(data$Year >= 1980 & data$Year <= 2005), ],
{
count <- ave(Site_ID, Site_ID, FUN = length)
}
)
# Filter the data.frame to contain records for sites above "n":
# n_observation_sites => data.frame
n_observation_sites <- sites_counted_df[which(count > n),]
数据:
data <- data.frame(
Site_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
Year = c(1976, 1977, 1978, 2004, 2005, 2006, 2003, 2004, 2005),
AnnualMean = c(1.1, 1.2, 1.1, 2.1, 2.6, 3.1, 2.7, 2.6, 1.9)
)
此解决方案非常有效,非常感谢您,并对延迟回复表示歉意!此解决方案非常有效,非常感谢您,并对延迟回复表示歉意!