R 字符串中元素的计数外观
我得到了以下数据集:R 字符串中元素的计数外观,r,dplyr,strsplit,R,Dplyr,Strsplit,我得到了以下数据集: structure(list(ID = c(5L, 6L, 7L, 8L, 10L), chain = c("x49", "x43", "x32 > x42 > x49 > x45 > x20 > x50 > x38", "x54 > x44",
structure(list(ID = c(5L, 6L, 7L, 8L, 10L), chain = c("x49",
"x43", "x32 > x42 > x49 > x45 > x20 > x50 > x38", "x54 > x44",
"x38 > x38")), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))
ID chain
1: 5 x49
2: 6 x43
3: 7 x32 > x42 > x49 > x45 > x20 > x50 > x38
4: 8 x54 > x44
5: 10 x38 > x38
链列表示产品的购买过程,也缺少一些信息(开始和购买)。目标是对链中的每个值进行两次计数(来源例如从和目的地例如从到),要做到这一点,我需要重新构造数据集。
例如,重新构造的链x54>x44
应如下所示:
from to
1 start x54
2 x54 x44
3 x44 buy
整个结果应如下所示:
from to
1 start x49
2 x49 buy
3 start x43
4 x43 buy
5 start x32
6 x32 x42
7 x42 x49
8 x49 x45
9 x45 x20
10 x20 x50
11 x38 buy
12 start x54
13 x54 x44
14 x44 buy
15 start x54
16 x54 x44
17 x44 buy
18 start x38
19 x38 x38
20 x38 buy
我已经试过了,但我不确定这是否是个好主意(也不知道如何继续下去)
df一种基本的R方法是分割“>”
上的字符串,并创建一个组合所有值的数据帧
do.call(rbind, lapply(strsplit(df$chain, " > "), function(x)
data.frame(from = c("start",x), to = c(x, "buy"))))
# from to
#1 start x49
#2 x49 buy
#3 start x43
#4 x43 buy
#5 start x32
#6 x32 x42
#7 x42 x49
#8 x49 x45
#9 x45 x20
#10 x20 x50
#11 x50 x38
#12 x38 buy
#13 start x54
#14 x54 x44
#15 x44 buy
#16 start x38
#17 x38 x38
#18 x38 buy
使用类似的方法,一个tidyverse
的方法将是
library(tidyverse)
map_dfr(str_split(df$chain, " > "), ~tibble(from = c("start",.), to = c(., "buy")))
我们可以使用str_c
将字符串粘贴在开头和结尾,使用separate_rows
使用tidyverse
扩展数据集
library(tidyverse)
dt %>%
mutate(chain = str_c("start > ", chain, " > buy")) %>%
separate_rows(chain) %>% group_by(ID) %>%
transmute(from = chain, to = lead(chain)) %>%
na.omit %>%
ungroup %>%
select(-ID)
# A tibble: 18 x 2
# from to
# <chr> <chr>
# 1 start x49
# 2 x49 buy
# 3 start x43
# 4 x43 buy
# 5 start x32
# 6 x32 x42
# 7 x42 x49
# 8 x49 x45
# 9 x45 x20
#10 x20 x50
#11 x50 x38
#12 x38 buy
#13 start x54
#14 x54 x44
#15 x44 buy
#16 start x38
#17 x38 x38
#18 x38 buy
库(tidyverse)
dt%>%
突变(链=str_c(“开始>”,链“>购买”)%%
分隔行(链)%>%group\U by(ID)%>%
转化(从=链,到=铅(链))%>%
na.省略%>%
解组%>%
选择(-ID)
#一个tibble:18x2
#从到
#
#1开始x49
#2 x49购买
#3开始x43
#4 x43购买
#5开始x32
#6×32×42
#7 x42 x49
#8 x49 x45
#9 x45 x20
#10×20×50
#11 x50 x38
#12×38购买
#13开始x54
#14×54×44
#15 x44购买
#16开始x38
#17 x38 x38
#18 x38购买
因为您已经有了一个数据表
,并且写下性能可能是一个问题,请检查数据。这里的表
备选方案:<代码>d[,{x
library(tidyverse)
dt %>%
mutate(chain = str_c("start > ", chain, " > buy")) %>%
separate_rows(chain) %>% group_by(ID) %>%
transmute(from = chain, to = lead(chain)) %>%
na.omit %>%
ungroup %>%
select(-ID)
# A tibble: 18 x 2
# from to
# <chr> <chr>
# 1 start x49
# 2 x49 buy
# 3 start x43
# 4 x43 buy
# 5 start x32
# 6 x32 x42
# 7 x42 x49
# 8 x49 x45
# 9 x45 x20
#10 x20 x50
#11 x50 x38
#12 x38 buy
#13 start x54
#14 x54 x44
#15 x44 buy
#16 start x38
#17 x38 x38
#18 x38 buy