Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/73.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/flutter/10.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在R中简化和汇总数据表_R - Fatal编程技术网

在R中简化和汇总数据表

在R中简化和汇总数据表,r,R,我的一个数据集如下 name alias (list of alias) x c("R","V","Q") y "Z" q c("A", "R", "M") w c("C","A","R") 我想首先将表格简化如下 name alias x "R" x "V" x "Q" y "Z" q "A" q "R" q "M" w "C" w "A" w "R" 然后修改数据以获得 alias name "R" c(x,q,w) "V"

我的一个数据集如下

name  alias (list of alias)
x     c("R","V","Q")
y     "Z"
q     c("A", "R", "M")
w     c("C","A","R")
我想首先将表格简化如下

name alias 
x  "R"
x  "V"
x  "Q"
y  "Z"
q  "A"
q  "R"
q  "M"
w  "C"
w  "A"
w  "R"
然后修改数据以获得

alias name 
"R"   c(x,q,w)
"V"   x
"Q"   x
"Z"   y
"A"  c(q,w)
"M"  q
"C"  w
如何在R中实现这一点

这是实际的数据集

> \dput(head(cases))
structure(list(caseid = c(7703415, 7758128, 7858259, 8802954, 
8829620, 8847200), tcount = c(2L, 2L, 3L, 10L, 4L, 2L), helplinks = c("character(0", 
"c(\"60107\", \"56085\", \"57587\", \"3000020\"", "character(0", 
"character(0", "c(\"60107\", \"3000023\", \"3000020\", \"60107\", \"56085\", \"57587\"", 
"character(0")), .Names = c("caseid", "tcount", "helplinks"), row.names = c(NA, 
6L), class = "data.frame")

> head(cases)
   caseid tcount                                                  helplinks
1 7703415      2                                                character(0
2 7758128      2                     c("60107", "56085", "57587", "3000020"
3 7858259      3                                                character(0
4 8802954     10                                                character(0
5 8829620      4 c("60107", "3000023", "3000020", "60107", "56085", "57587"
6 8847200      2                                                character(0
新答案 使用我的splistackshape软件包中的cSplit:

cSplit(cases, "helplinks", ",", "long")[, helplinks := gsub(
  'character\\(0|c\\(|\\"', "", helplinks)][, list(
    caseid = list(caseid)), by = helplinks]
#    helplinks                          caseid
# 1:           7703415,7858259,8802954,8847200
# 2:     60107         7758128,8829620,8829620
# 3:     56085                 7758128,8829620
# 4:     57587                 7758128,8829620
# 5:   3000020                 7758128,8829620
# 6:   3000023                         8829620
旧答案 我假设你是这样开始的:

df <- data.frame(
  name = c("x", "y", "q", "w"),
  alias = I(list(c("R","V","Q"), "Z", c("A", "R", "M"), c("C","A","R")))
)
df
#   name   alias
# 1    x R, V, Q
# 2    y       Z
# 3    q A, R, M
# 4    w C, A, R
您并不需要splitstackshape,因此,如果您想删除我答案中的自我提升部分,只需使用data.table,您可以执行以下操作:

library(data.table)
as.data.table(df)[, list(
  alias = unlist(alias)), by = name][, list(
  name = list(name)), by = alias]
新答案 使用我的splistackshape软件包中的cSplit:

cSplit(cases, "helplinks", ",", "long")[, helplinks := gsub(
  'character\\(0|c\\(|\\"', "", helplinks)][, list(
    caseid = list(caseid)), by = helplinks]
#    helplinks                          caseid
# 1:           7703415,7858259,8802954,8847200
# 2:     60107         7758128,8829620,8829620
# 3:     56085                 7758128,8829620
# 4:     57587                 7758128,8829620
# 5:   3000020                 7758128,8829620
# 6:   3000023                         8829620
旧答案 我假设你是这样开始的:

df <- data.frame(
  name = c("x", "y", "q", "w"),
  alias = I(list(c("R","V","Q"), "Z", c("A", "R", "M"), c("C","A","R")))
)
df
#   name   alias
# 1    x R, V, Q
# 2    y       Z
# 3    q A, R, M
# 4    w C, A, R
您并不需要splitstackshape,因此,如果您想删除我答案中的自我提升部分,只需使用data.table,您可以执行以下操作:

library(data.table)
as.data.table(df)[, list(
  alias = unlist(alias)), by = name][, list(
  name = list(name)), by = alias]
首先,我们清理角色0。然后我们读入那些曾经是列表但现在需要扫描的字符值。然后我们应用一个函数,从每一行生成一个数据帧:

good.case <- cases[ grepl("c\\(", cases$helplinks),]
 lapply( split(good.case, row.names(good.case) ), function(d){
   vec <- scan(text=gsub("c\\(|[, ]", "", d$helplinks) ,what="")
   do.call( data.frame, list(caseid=d$caseid, alias=vec) )
 }
 )
#-------
#Read 4 items
#Read 6 items
$`2`
   caseid   alias
1 7758128   60107
2 7758128   56085
3 7758128   57587
4 7758128 3000020

$`5`
   caseid   alias
1 8829620   60107
2 8829620 3000023
3 8829620 3000020
4 8829620   60107
5 8829620   56085
6 8829620   57587

 expanded <- lapply( split(good.case, row.names(good.case) ), function(d){
    vec <- scan(text=gsub("c\\(|[, ]", "", d$helplinks) ,what="")
    do.call( data.frame, list(caseid=rep(d$caseid, length(vec)), alias=vec) )
  }
  )
#Read 4 items
#Read 6 items
但我想只有一半的路。阿南达的5个插入符号答案放在那里,没有必要再进一步研究了。

首先,我们要清理角色0。然后我们读入那些曾经是列表但现在需要扫描的字符值。然后我们应用一个函数,从每一行生成一个数据帧:

good.case <- cases[ grepl("c\\(", cases$helplinks),]
 lapply( split(good.case, row.names(good.case) ), function(d){
   vec <- scan(text=gsub("c\\(|[, ]", "", d$helplinks) ,what="")
   do.call( data.frame, list(caseid=d$caseid, alias=vec) )
 }
 )
#-------
#Read 4 items
#Read 6 items
$`2`
   caseid   alias
1 7758128   60107
2 7758128   56085
3 7758128   57587
4 7758128 3000020

$`5`
   caseid   alias
1 8829620   60107
2 8829620 3000023
3 8829620 3000020
4 8829620   60107
5 8829620   56085
6 8829620   57587

 expanded <- lapply( split(good.case, row.names(good.case) ), function(d){
    vec <- scan(text=gsub("c\\(|[, ]", "", d$helplinks) ,what="")
    do.call( data.frame, list(caseid=rep(d$caseid, length(vec)), alias=vec) )
  }
  )
#Read 4 items
#Read 6 items

但我想只有一半的路。阿南达的5个插入符号答案放在那里,没有必要进一步研究。

第一个表中的数据在使用c函数时是否真的是这样?也许最好发布dputheaddataHello Richard,是的,数据看起来像第一个表。基本上,别名是一个文本中唯一单词的列表。@RichardScriven,我认为它是一个列表列。这是一个糟糕的数据集,你从哪里得到的?马上解雇他们。@RichardScriven,好吧,我收回这句话。不是列表列。而且,是的,它是可怕的!使用c函数时,第一个表中的数据实际上是这样的吗?也许最好发布dputheaddataHello Richard,是的,数据看起来像第一个表。基本上,别名是一个文本中唯一单词的列表。@RichardScriven,我认为它是一个列表列。这是一个糟糕的数据集,你从哪里得到的?马上解雇他们。@RichardScriven,好吧,我收回这句话。不是列表列。而且,是的,它是可怕的!答案比我的好。但它并没有真正回答这个问题。哇。我真不敢相信你同时做了两个步骤。比我的回答更好。但它并没有真正回答这个问题。哇。我真不敢相信你同时做了两个步骤。哇!