Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/tensorflow/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 第一次遇到特定字符串后的子集数据帧_R_Dplyr_Data.table_Subset_Tidyr - Fatal编程技术网

R 第一次遇到特定字符串后的子集数据帧

R 第一次遇到特定字符串后的子集数据帧,r,dplyr,data.table,subset,tidyr,R,Dplyr,Data.table,Subset,Tidyr,我有一个以下格式的数据框,我想提取或子集该数据框,以便我在每个项目中只有第一个融资活动之前的活动: project<- c('A', 'A', 'A', 'B', 'B', 'B','B', 'C', 'C') activity<- c('kickoff','funding', 'delivery', 'kickoff','kickoff','funding','kickoff', 'kickoff','delivery') df<- data.frame(projec

我有一个以下格式的数据框,我想提取或子集该数据框,以便我在每个项目中只有第一个
融资
活动之前的活动:

 project<- c('A', 'A', 'A', 'B', 'B', 'B','B', 'C', 'C')
 activity<- c('kickoff','funding', 'delivery', 'kickoff','kickoff','funding','kickoff', 'kickoff','delivery')

 df<- data.frame(project,activity)

有什么建议吗?

您可以尝试
cumsum
来跟踪每个项目在资助之前或之后是否发生了一行:

library(dplyr)

df %>%
  group_by(project) %>%
  mutate(before.funding = cumsum(activity == "funding") == 0) %>%
  ungroup() %>%
  filter(before.funding) %>%
  select(-before.funding)

# A tibble: 5 x 2
  project activity
   <fctr>   <fctr>
1       A  kickoff
2       B  kickoff
3       B  kickoff
4       C  kickoff
5       C delivery
库(dplyr)
df%>%
分组单位(项目)%>%
突变(before.funding=cumsum(activity==“funding”)==0)%>%
解组()%>%
筛选(资助前)%>%
选择(-before.financing)
#一个tibble:5x2
项目活动
1开球
2 B启动
3 B启动
4 C启动
5 C交货
dplyr

df %>%
    group_by(project) %>%
    dplyr::filter(cummin(activity != "funding") == 1)
收益率:

# project activity
# <fctr>   <fctr>
# 1       A  kickoff
# 2       B  kickoff
# 3       B  kickoff
# 4       C  kickoff
# 5       C delivery
# project activity
# A       kickoff 
# B       kickoff 
# B       kickoff 
# C       kickoff 
# C       delivery
收益率:

# project activity
# <fctr>   <fctr>
# 1       A  kickoff
# 2       B  kickoff
# 3       B  kickoff
# 4       C  kickoff
# 5       C delivery
# project activity
# A       kickoff 
# B       kickoff 
# B       kickoff 
# C       kickoff 
# C       delivery

我希望这会有所帮助。

为了完整起见,这里还有一个
数据。表
解决方案:

library(data.table)
setDT(df)[!df[, .I[.I >= first(.I[activity == 'funding'])], by = project]$V1]
解释 在每个
项目
组中,我们在
活动
列和所有后续行中查找第一次出现的
资金
的索引:

df[, .I[.I >= first(.I[activity == 'funding'])], by = project]
data.table
中,
.I
是一个特殊符号,用于保存
df
中的行位置。第二个子集
.I[.I>=first(.I[activity=='funding'])]
是必需的,因为
哪个(.I>=first(.I[activity=='funding'])
将只返回组内的行位置,而不返回
df
内的行位置

现在,我们已经确定了不应该显示的行。因此,我们通过排除这些行号得到最终结果:

df[!df[, .I[.I >= first(.I[activity == 'funding'])], by = project]$V1]

如果有可用的日期信息-我打赌在处理项目和活动时会有一个
日期
列-我们可以按照@Frank的建议,使用日期列进行反不平等联接:

# create sample date with date column
project<- c('A', 'A', 'A', 'B', 'B', 'B','B', 'C', 'C')
activity<- c('kickoff','funding', 'delivery', 'kickoff','kickoff','funding','kickoff', 'kickoff','delivery')
date <- (as.Date ("2017-10-02") + c(1,4,7,2,5,8,11,3,6))
df <- data.frame(project,activity, date, stringsAsFactors = FALSE)
df <- df[order(df$date), ]

使用
数据的一些其他替代方案。表
包:

1)使用
Reduce

library(data.table)
setDT(df)[df[, .I[!Reduce('+', activity == 'funding', accumulate = TRUE)], project]$V1]
library(data.table)
setDT(df)[df[, .I[!cummax(activity == 'funding')], project]$V1]
library(data.table)
setDT(df)[!df[, pmax(.I, .I[activity == 'funding']), by = project]$V1]
2)使用
cummax

library(data.table)
setDT(df)[df[, .I[!Reduce('+', activity == 'funding', accumulate = TRUE)], project]$V1]
library(data.table)
setDT(df)[df[, .I[!cummax(activity == 'funding')], project]$V1]
library(data.table)
setDT(df)[!df[, pmax(.I, .I[activity == 'funding']), by = project]$V1]
3)使用
pmax

library(data.table)
setDT(df)[df[, .I[!Reduce('+', activity == 'funding', accumulate = TRUE)], project]$V1]
library(data.table)
setDT(df)[df[, .I[!cummax(activity == 'funding')], project]$V1]
library(data.table)
setDT(df)[!df[, pmax(.I, .I[activity == 'funding']), by = project]$V1]