R 通过分隔项将列表类型列转换为长格式
我有一个表,其中有两列感兴趣的内容,如下所示: 状态|id|标签R 通过分隔项将列表类型列转换为长格式,r,dataframe,split,R,Dataframe,Split,我有一个表,其中有两列感兴趣的内容,如下所示: 状态|id|标签 947306525726527488 |新年七部919 947306316959281153 | MakeItALifestyle 947306315952611330 | c(“Ejuice”、“vape”、“vaping”) 947306265520328704 | c(“vapefam”、“vapenation”、“vapefamily”) 947305941522771968 |正在播放 数据 structure(list
947306525726527488 |新年七部919
947306316959281153 | MakeItALifestyle
947306315952611330 | c(“Ejuice”、“vape”、“vaping”)
947306265520328704 | c(“vapefam”、“vapenation”、“vapefamily”)
947305941522771968 |正在播放 数据
structure(list(status_id = c("947306525726527488", "947306316959281153",
"947306315952611330", "947306265520328704", "947305941522771968"
), hashtags = list("NEWYEARSEVEPARTY919", "MakeItALifestyle",
c("Ejuice", "vape", "vaping", "eliquid", "ecigjuice", "ecig",
"vapejuice"), c("vapefam", "vapenation", "vapefamily", "vapelife",
"vapelyfe", "vapeon", "positivity"), "nowplaying")), .Names = c("status_id",
"hashtags"), row.names = c(NA, -5L), class = c("tbl_df", "tbl",
"data.frame"))
预期结果
我想要以下两个表(当然,在实际的原始df中,我删除了更多的列,因为它们与问题无关):
df1状态\u id
947306525726527488
947306316959281153
947306315952611330
947306265520328704
947305941522771968 及 df2
状态|id|标签
947306525726527488 |新年七部919
947306316959281153 | MakeItALifestyle
947306315952611330 | Ejuice
947306315952611330 | vape
947306315952611330 |抽气
947306265520328704 |瓦佩法姆
947306265520328704 |惩罚
947306265520328704 |瓦佩家族
947305941522771968 |正在播放 原始数据每个status_id有一行,所有hashtag>1都是c(…)——分类为type:“list”。df2将各个hashtag分隔成单独的行 虽然我以前从未遇到过列表类型的列,但在谷歌上搜索它让我在将列表转换为列而不是“list”类型的列(data.table)时学到了很多东西 图书馆(dplyr) rm(list=ls())
这里有一个可能的解决方案。我调用了您的数据
mydf
。您在hashtags
中有列表。您可以使用unlist()
和paste()
为hashtags
中的每一行创建一个向量。如果需要,可以使用toSting()
而不是paste()
。一旦在hashtags
中有一个向量,就要将其拆分。具体来说,对于第3行和第4行,您有多个hashtag。你想把它们分开。您可以使用splitstackshape
包中的cSplit()
。结果就是您想要的df2
。一旦有了它,就要创建df1
。选择status\u id
并查找唯一的status\u id
library(dplyr)
library(splitstackshape)
df2 <- mydf %>%
rowwise %>%
mutate(hashtags = paste(unlist(hashtags), collapse = ",")) %>%
cSplit(splitCols = "hashtags", sep = ",", direction = "long")
status_id hashtags
1: 947306525726527488 NEWYEARSEVEPARTY919
2: 947306316959281153 MakeItALifestyle
3: 947306315952611330 Ejuice
4: 947306315952611330 vape
5: 947306315952611330 vaping
6: 947306315952611330 eliquid
7: 947306315952611330 ecigjuice
8: 947306315952611330 ecig
9: 947306315952611330 vapejuice
10: 947306265520328704 vapefam
11: 947306265520328704 vapenation
12: 947306265520328704 vapefamily
13: 947306265520328704 vapelife
14: 947306265520328704 vapelyfe
15: 947306265520328704 vapeon
16: 947306265520328704 positivity
17: 947305941522771968 nowplaying
df1 <- unique(df2[, 1, with = FALSE])
status_id
1: 947306525726527488
2: 947306316959281153
3: 947306315952611330
4: 947306265520328704
5: 947305941522771968
为了完整起见,这里还有一个
数据表
解决方案:
library(data.table)
df2 <- setDT(juice)[, .(hashtag = unlist(hashtags)), by = status_id]
df1 <- unique(juice[, .(status_id)])
df2
也许像这样的
df1-original是str()代码上面最上面的表。df1和df2是理想的结果原始数据和df2
之间的区别是什么?原始数据每个状态id有一行,所有哈希标记为c(…)-分类为类型:“列表”。df2将单独的hashtag分隔为单独的行,并将其覆盖。因此,对于df1:-df1工作得非常完美-请将此作为最佳答案simplest@SaleemKhan很高兴能帮助你。:)@SaleemKhan,既然你已经在使用“tidyverse”,你就不能只做unest(juice)
?或者,使用“splitstackshape”listCol_l(juice,“hashtags”)[]
:-)@A5C1D2H2I1M1N2O1R2T1我明白了。帽子很吸引人,不是吗?:)如果我能帮忙的话,我很乐意帮你更新。很好的想法可以找到解决方案。
library(dplyr)
library(splitstackshape)
df2 <- mydf %>%
rowwise %>%
mutate(hashtags = paste(unlist(hashtags), collapse = ",")) %>%
cSplit(splitCols = "hashtags", sep = ",", direction = "long")
status_id hashtags
1: 947306525726527488 NEWYEARSEVEPARTY919
2: 947306316959281153 MakeItALifestyle
3: 947306315952611330 Ejuice
4: 947306315952611330 vape
5: 947306315952611330 vaping
6: 947306315952611330 eliquid
7: 947306315952611330 ecigjuice
8: 947306315952611330 ecig
9: 947306315952611330 vapejuice
10: 947306265520328704 vapefam
11: 947306265520328704 vapenation
12: 947306265520328704 vapefamily
13: 947306265520328704 vapelife
14: 947306265520328704 vapelyfe
15: 947306265520328704 vapeon
16: 947306265520328704 positivity
17: 947305941522771968 nowplaying
df1 <- unique(df2[, 1, with = FALSE])
status_id
1: 947306525726527488
2: 947306316959281153
3: 947306315952611330
4: 947306265520328704
5: 947305941522771968
df2 <- listCol_l(mydf, "hashtags")
library(data.table)
df2 <- setDT(juice)[, .(hashtag = unlist(hashtags)), by = status_id]
df1 <- unique(juice[, .(status_id)])
df2
status_id hashtag
1: 947306525726527488 NEWYEARSEVEPARTY919
2: 947306316959281153 MakeItALifestyle
3: 947306315952611330 Ejuice
4: 947306315952611330 vape
5: 947306315952611330 vaping
6: 947306315952611330 eliquid
7: 947306315952611330 ecigjuice
8: 947306315952611330 ecig
9: 947306315952611330 vapejuice
10: 947306265520328704 vapefam
11: 947306265520328704 vapenation
12: 947306265520328704 vapefamily
13: 947306265520328704 vapelife
14: 947306265520328704 vapelyfe
15: 947306265520328704 vapeon
16: 947306265520328704 positivity
17: 947305941522771968 nowplaying
df1
status_id
1: 947306525726527488
2: 947306316959281153
3: 947306315952611330
4: 947306265520328704
5: 947305941522771968