使用R将行保留到第一次出现
我在R'HLS'中有一个数据框,基本上是访问者访问网站时的页面细节。每一行代表从第3页到最多第10页的每一次访问,如果他移动到第10页,则第10页由页数表示,如下所示使用R将行保留到第一次出现,r,R,我在R'HLS'中有一个数据框,基本上是访问者访问网站时的页面细节。每一行代表从第3页到最多第10页的每一次访问,如果他移动到第10页,则第10页由页数表示,如下所示 ID page_count purchase_flag prob hl_flag V1 3 1 0.76 1 V1 4 1 0.65 1 V1 5
ID page_count purchase_flag prob hl_flag
V1 3 1 0.76 1
V1 4 1 0.65 1
V1 5 1 0.04 0
V1 6 1 0.86 1
V1 7 1 0.04 0
V1 8 1 0.65 1
V1 9 1 0.01 0
V1 10 1 0.00 0
V2 3 0 0.03 0
V2 4 0 0.01 0
V2 5 0 0.02 0
V2 6 0 0.00 0
V3 3 1 0.02 0
V3 4 1 0.001 0
V3 5 1 0.76 1
V3 6 1 0.03 0
V4 3 0 0.04 0
V4 4 0 0.65 1
V4 5 0 0.03 0
我想创建一个表,该表在第一次出现hl_标志=1之前,如果该情况为真,那么该表将接收行;如果hl_标志=0,则该表将接收任何ID的所有行。输出需要如下所示
ID page_count purchase_flag prob hl_flag
V1 3 1 0.76 1
V2 3 0 0.03 0
V2 4 0 0.01 0
V2 5 0 0.02 0
V2 6 0 0.00 0
V3 3 1 0.02 0
V3 4 1 0.001 0
V3 5 1 0.76 1
V4 3 0 0.04 0
V4 4 0 0.65 1
提前谢谢你的帮助
更新:
添加dput的输出,如下所示
structure(list(ung_id = c("00000f23-1019-4aff-8199-35bd0d032356/1",
"00000f23-1019-4aff-8199-35bd0d032356/1", "00000f23-1019-4aff-8199-35bd0d032356/1",
"00000f23-1019-4aff-8199-35bd0d032356/1", "00002b20-82d4-497b-a137-34e3bb4eaf74/1",
"00002b20-82d4-497b-a137-34e3bb4eaf74/1", "00002b20-82d4-497b-a137-34e3bb4eaf74/1",
"0000aeff-2d8b-4daa-a084-fb2980f1feed/1", "0000aeff-2d8b-4daa-a084-fb2980f1feed/1",
"0000b96e-566f-4b6e-925a-b7dcfd4a7208/1", "0000b96e-566f-4b6e-925a-b7dcfd4a7208/1",
"0000b96e-566f-4b6e-925a-b7dcfd4a7208/1", "0000b96e-566f-4b6e-925a-b7dcfd4a7208/1",
"0000b96e-566f-4b6e-925a-b7dcfd4a7208/1", "0000b96e-566f-4b6e-925a-b7dcfd4a7208/1",
"0000b96e-566f-4b6e-925a-b7dcfd4a7208/1", "0000b96e-566f-4b6e-925a-b7dcfd4a7208/1",
"0000d089-edda-4c8b-8b17-d9def3cae7cf/1", "0000d089-edda-4c8b-8b17-d9def3cae7cf/1",
"0000d089-edda-4c8b-8b17-d9def3cae7cf/1"), nop_count = c(3L,
4L, 5L, 6L, 3L, 4L, 5L, 3L, 4L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
3L, 4L, 5L), purchase_flag = c(1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), prob = c(0.0777615841278747,
0.0738346887497272, 0.0741130887754292, 0.0785370078084892, 0.0619573259953132,
0.0516201527986966, 0.0562025814090338, 0.0837301511694211, 0.0579033581198143,
0.0364358545936557, 0.0329682922619259, 0.0420157964561273, 0.049855260762479,
0.0500481302257314, 0.0463893143028813, 0.049855260762479, 0.0391886960037603,
0.0683568422952682, 0.0570168506417919, 0.0661965354597502),
decile = structure(c(8L, 8L, 8L, 8L, 6L, 4L, 5L, 8L, 5L,
1L, 1L, 2L, 4L, 4L, 3L, 4L, 2L, 7L, 5L, 7L), .Label = c("(0.0257,0.0364]",
"(0.0364,0.0428]", "(0.0428,0.0482]", "(0.0482,0.0531]",
"(0.0531,0.0583]", "(0.0583,0.0645]", "(0.0645,0.0722]",
"(0.0722,0.0842]"), class = "factor"), hl_Flag = c(1L, 1L,
1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 1L, 1L)), .Names = c("ung_id", "nop_count", "purchase_flag",
"prob", "decile", "hl_Flag"), row.names = c(NA, -20L), .internal.selfref = <pointer: 0x00000000002b0788>, class = c("data.table",
"data.frame"))
结构(列表(ung_id=c(“00000f23-1019-4aff-8199-35bd0d032356/1”),
“00000f23-1019-4aff-8199-35bd0d032356/1”、“00000f23-1019-4aff-8199-35bd0d032356/1”,
“00000f23-1019-4aff-8199-35bd0d032356/1”、“00002b20-82d4-497b-a137-34e3bb4eaf74/1”,
“00002b20-82d4-497b-a137-34e3bb4eaf74/1”、“00002b20-82d4-497b-a137-34e3bb4eaf74/1”,
“0000aeff-2d8b-4daa-a084-fb2980f1feed/1”、“0000aeff-2d8b-4daa-a084-fb2980f1feed/1”,
“0000b96e-566f-4b6e-925a-b7dcfd4a7208/1”、“0000b96e-566f-4b6e-925a-b7dcfd4a7208/1”,
“0000b96e-566f-4b6e-925a-b7dcfd4a7208/1”、“0000b96e-566f-4b6e-925a-b7dcfd4a7208/1”,
“0000b96e-566f-4b6e-925a-b7dcfd4a7208/1”、“0000b96e-566f-4b6e-925a-b7dcfd4a7208/1”,
“0000b96e-566f-4b6e-925a-b7dcfd4a7208/1”、“0000b96e-566f-4b6e-925a-b7dcfd4a7208/1”,
“0000d089-edda-4c8b-8b17-d9def3cae7cf/1”、“0000d089-edda-4c8b-8b17-d9def3cae7cf/1”,
“0000d089-edda-4c8b-8b17-d9def3cae7cf/1”,nop_计数=c(3L,
4L,5L,6L,3L,4L,5L,3L,4L,3L,4L,5L,6L,7L,8L,9L,10L,
3L,4L,5L),采购标志=c(1L,1L,1L,0L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L),概率=c(0.0777615841278747,
0.0738346887497272, 0.0741130887754292, 0.0785370078084892, 0.0619573259953132,
0.0516201527986966, 0.0562025814090338, 0.0837301511694211, 0.0579033581198143,
0.0364358545936557, 0.0329682922619259, 0.0420157964561273, 0.049855260762479,
0.0500481302257314, 0.0463893143028813, 0.049855260762479, 0.0391886960037603,
0.0683568422952682, 0.0570168506417919, 0.0661965354597502),
十分位数=结构(c)(8L,8L,8L,8L,6L,4L,5L,8L,5L,
1L,1L,2L,4L,4L,3L,4L,2L,7L,5L,7L),标签=c(“(0.0257,0.0364)”,
"(0.0364,0.0428]", "(0.0428,0.0482]", "(0.0482,0.0531]",
"(0.0531,0.0583]", "(0.0583,0.0645]", "(0.0645,0.0722]",
“(0.0722,0.0842]”,class=“factor”),hl_标志=c(1L,1L,
1L,1L,1L,0L,1L,1L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,
1L,1L,1L)),.Names=c(“ung\u id”,“nop\u count”,“purchase\u flag”,
“prob”,“decile”,“hl_标志”),row.names=c(NA,-20L),.internal.selfref=,class=c(“data.table”,
“data.frame”))
一个选项将使用
data.table
。我们将“data.frame”转换为“data.table”(setDT(HLS)
),按“ID”分组,我们检查是否有任何为1的“hl_标志”值。在这种情况下,我们使用which.max
获得hl_标志中第一次出现的1的索引,获得序列(1:(which.max..
),查找行索引(.I
)或else
仅返回行索引(.I
),提取具有行索引($V1
)的列,并使用该列对行进行子集化
library(data.table)
setDT(HLS)[HLS[, if(any(hl_flag==1)) .I[1:(which.max(hl_flag))]
else .I, ID]$V1]
# ID page_count purchase_flag prob hl_flag
# 1: V1 3 1 0.760 1
# 2: V2 3 0 0.030 0
# 3: V2 4 0 0.010 0
# 4: V2 5 0 0.020 0
# 5: V2 6 0 0.000 0
# 6: V3 3 1 0.020 0
# 7: V3 4 1 0.001 0
# 8: V3 5 1 0.760 1
# 9: V4 3 0 0.040 0
#10: V4 4 0 0.650 1
或者类似于我为数据显示的方法。表,一个基本R
选项
do.call(rbind, lapply(split(HLS, HLS$ID),
function(x) if(any(x$hl_flag==1))
x[seq(which.max(x$hl_flag)), ]
else x))
或者使用dplyr
library(dplyr)
HLS %>%
group_by(ID) %>%
filter(all(!hl_flag)| row_number() %in% seq(which.max(hl_flag)))
# ID page_count purchase_flag prob hl_flag
# (chr) (int) (int) (dbl) (int)
#1 V1 3 1 0.760 1
#2 V2 3 0 0.030 0
#3 V2 4 0 0.010 0
#4 V2 5 0 0.020 0
#5 V2 6 0 0.000 0
#6 V3 3 1 0.020 0
#7 V3 4 1 0.001 0
#8 V3 5 1 0.760 1
#9 V4 3 0 0.040 0
#10 V4 4 0 0.650 1
一个选项是使用data.table
。我们将'data.frame'转换为'data.table'(setDT(HLS)
),按'ID'分组,我们检查是否有任何的'hl_flag'值为1。在这种情况下,我们使用which.max
获得hl_flag中第一次出现的1的索引,得到序列(1:(which.max..
),查找行索引(.I
)或else
仅返回行索引(.I
),提取具有行索引($V1
)的列,并使用该列对行进行子集化
library(data.table)
setDT(HLS)[HLS[, if(any(hl_flag==1)) .I[1:(which.max(hl_flag))]
else .I, ID]$V1]
# ID page_count purchase_flag prob hl_flag
# 1: V1 3 1 0.760 1
# 2: V2 3 0 0.030 0
# 3: V2 4 0 0.010 0
# 4: V2 5 0 0.020 0
# 5: V2 6 0 0.000 0
# 6: V3 3 1 0.020 0
# 7: V3 4 1 0.001 0
# 8: V3 5 1 0.760 1
# 9: V4 3 0 0.040 0
#10: V4 4 0 0.650 1
或者类似于我为数据显示的方法。表,一个基本R
选项
do.call(rbind, lapply(split(HLS, HLS$ID),
function(x) if(any(x$hl_flag==1))
x[seq(which.max(x$hl_flag)), ]
else x))
或者使用dplyr
library(dplyr)
HLS %>%
group_by(ID) %>%
filter(all(!hl_flag)| row_number() %in% seq(which.max(hl_flag)))
# ID page_count purchase_flag prob hl_flag
# (chr) (int) (int) (dbl) (int)
#1 V1 3 1 0.760 1
#2 V2 3 0 0.030 0
#3 V2 4 0 0.010 0
#4 V2 5 0 0.020 0
#5 V2 6 0 0.000 0
#6 V3 3 1 0.020 0
#7 V3 4 1 0.001 0
#8 V3 5 1 0.760 1
#9 V4 3 0 0.040 0
#10 V4 4 0 0.650 1
你可以试试
l <- lapply(split(df, df$ID), function(x) {if(any(x[5] == 1)) x[1:which.max(x[5] == 1),] else x})
要获得预期结果,您可以使用dplyr
包中的bind_行
library(dplyr)
bind_rows(l)
#ID page_count purchase_flag prob hl_flag
#(fctr) (int) (int) (dbl) (int)
#1 V1 3 1 0.760 1
#2 V2 3 0 0.030 0
#3 V2 4 0 0.010 0
#4 V2 5 0 0.020 0
#5 V2 6 0 0.000 0
#6 V3 3 1 0.020 0
#7 V3 4 1 0.001 0
#8 V3 5 1 0.760 1
#9 V4 3 0 0.040 0
#10 V4 4 0 0.650 1
你可以试试
l <- lapply(split(df, df$ID), function(x) {if(any(x[5] == 1)) x[1:which.max(x[5] == 1),] else x})
要获得预期结果,您可以使用dplyr
包中的bind_行
library(dplyr)
bind_rows(l)
#ID page_count purchase_flag prob hl_flag
#(fctr) (int) (int) (dbl) (int)
#1 V1 3 1 0.760 1
#2 V2 3 0 0.030 0
#3 V2 4 0 0.010 0
#4 V2 5 0 0.020 0
#5 V2 6 0 0.000 0
#6 V3 3 1 0.020 0
#7 V3 4 1 0.001 0
#8 V3 5 1 0.760 1
#9 V4 3 0 0.040 0
#10 V4 4 0 0.650 1
感谢akrun的回复。我刚刚澄清了我的id是否为“00000f23-1019-4aff-8199-35bd0d032356/1”而不是V1,因此应该做什么更改?@rahuliggu我们不需要更改任何内容,因为我们只使用“id”进行分组。您在原始数据集上尝试过吗?是的,我在origi上尝试过nal数据集,但它给了我一个错误“错误:中意外的数字常量:”,因此我想知道这是否与代码结尾“else.I,ID]$V1]”部分中使用的ID与V1不同有关。如果我对代码的一些基本理解有误,请原谅。R对我来说是非常新的,因此存在混淆。@rahuliggu您能检查一下吗estr(HLS)
和“ID”的类别列?ID的数据类型是Factor。是否应该更改为character?感谢您的回复akrun。我刚刚澄清了我的ID是否是格式为“00000f23-1019-4aff-8199-35bd0d032356/1”而不是V1,因此应该做什么更改?@rahuliggu我们不需要更改任何内容,因为我们只使用“ID”进行分组。你在原始数据集上试过这个吗?是的,我在原始数据集上试过,但它给了我一个错误“error:unexpected numeric constant in:”所以我想知道这是否与代码结尾“else.I,ID]$V1]”部分中使用的ID与V1不同有关。如果我理解了一些非常基本的错误代码,请原谅.R对我来说是非常新的,因此会产生混淆。@rahuliggu你能检查一下str(HLS)
和“ID”列的类别吗?ID的数据类型是Factor。它应该改为character吗?