Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/71.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
将团队重组为R中的个人级数据(同时保留团队级信息)_R - Fatal编程技术网

将团队重组为R中的个人级数据(同时保留团队级信息)

将团队重组为R中的个人级数据(同时保留团队级信息),r,R,我当前的数据如下所示: Person Team 10 100 11 100 12 100 10 200 11 200 14 200 15 200 Person1 Person2 Count Team1 Team2 Team3 10 11 2 100 200 NA 10 12 1 100 NA NA 11

我当前的数据如下所示:

Person  Team
  10    100
  11    100
  12    100
  10    200
  11    200
  14    200
  15    200
Person1 Person2 Count   Team1   Team2   Team3
   10      11     2      100     200     NA
   10      12     1      100     NA      NA
   11      12     1      100     NA      NA
   10      14     1      200     NA      NA
   10      15     1      200     NA      NA
   11      14     1      200     NA      NA
   11      15     1      200     NA      NA
我想根据他们在一起的团队推断谁认识谁。我还想计算一下一个二人组在一个团队中的次数,并且我想跟踪将每对人联系起来的团队标识码。换句话说,我想创建一个如下所示的数据集:

Person  Team
  10    100
  11    100
  12    100
  10    200
  11    200
  14    200
  15    200
Person1 Person2 Count   Team1   Team2   Team3
   10      11     2      100     200     NA
   10      12     1      100     NA      NA
   11      12     1      100     NA      NA
   10      14     1      200     NA      NA
   10      15     1      200     NA      NA
   11      14     1      200     NA      NA
   11      15     1      200     NA      NA

生成的数据集捕获可以根据原始数据集中列出的团队推断的关系。“Count”变量反映了一对人员在一个团队中的实例数。“Team1”、“Team2”和“Team3”变量列出了将每对人员相互链接的团队ID。将哪个人员/团队ID列在第一位与第二位没有区别。团队的规模从2人到8人不等。

通过自连接很容易获得计数,我认为使用
sqldf
最容易获得计数。(注意,我可能认为
sqldf
最简单,因为我不太擅长
数据。表
)编辑以包括@G.Grothendieck的建议:


我将把该列重命名留给您。

在@Gregor之后,使用Gregor的数据,我尝试添加团队列。我不能提供你所要求的,但这可能有用。在
dplyr
(dplyr 0.4)的开发版本中使用
full\u join
,我完成了以下操作。我使用
combn
为每个团队创建了一个包含所有人员组合的数据框,并将数据保存为对象
a
。然后,我将
a
按团队分开,并使用
full\u-join
。通过这种方式,我尝试创建团队列,至少为团队
100
200
。我使用
rename
更改列名,并选择
按您的方式对列进行排序

library(dplyr)

group_by(dd, Team) %>%
do(data.frame(t(combn(.$Person, 2)))) %>%
data.frame() ->a;
full_join(filter(a, Team == "100"), filter(a, Team == "200"), by = c("X1", "X2")) %>%
rename(Person1 = X1, Person2 = X2, Team1 = Team.x, Team2 = Team.y) %>%
select(Person1, Person2, Team1, Team2)

#  Person1 Person2 Team1 Team2
#1      10      11   100   200
#2      10      12   100    NA
#3      11      12   100    NA
#4      10      14    NA   200
#5      10      15    NA   200
#6      11      14    NA   200
#7      11      15    NA   200
#8      14      15    NA   200
编辑

我相信有更好的方法可以做到这一点。但是,这是我能做的最接近的了。我尝试使用此版本中的另一个联接添加计数

group_by(dd, Team) %>%
do(data.frame(t(combn(.$Person, 2)))) %>%
data.frame() ->a;
full_join(filter(a, Team == "100"), filter(a, Team == "200"), by = c("X1", "X2")) %>%
full_join(count(a, X1, X2), by = c("X1", "X2")) %>%
rename(Person1 = X1, Person2 = X2, Team1 = Team.x, Team2 = Team.y, Count = n) %>%
select(Person1, Person2, Count, Team1, Team2)

#  Person1 Person2 Count Team1 Team2
#1      10      11     2   100   200
#2      10      12     1   100    NA
#3      11      12     1   100    NA
#4      10      14     1    NA   200
#5      10      15     1    NA   200
#6      11      14     1    NA   200
#7      11      15     1    NA   200
#8      14      15     1    NA   200

以下是一般解决方案:

library(dplyr)
library(reshape2)

find.friends <- function(d,n=2) {
    d$exist <- T

    z <- dcast(d,Person~Team,value.var='exist')
    #       Person  100  200
    #     1     10 TRUE TRUE
    #     2     11 TRUE TRUE
    #     3     12 TRUE   NA
    #     4     14   NA TRUE
    #     5     15   NA TRUE


    pairs.per.team <- sapply(
                        sort(unique(d$Team)),
                        function(team) {
                            non.na <- !is.na(z[,team])
                            if (sum(non.na)<n) return()
                            combns <- t(combn(z$Person[non.na],n))
                            cbind(combns,team)
                        }
    )
    df <- as.data.frame(do.call(rbind,pairs.per.team))    
    if (nrow(df)==0) return()

    persons <- sprintf('Person%i',1:n)
    colnames(df)[1:n] <- persons
    #       Person1 Person2 team
    #     1      10      11  100
    #     2      10      12  100
    #     3      11      12  100
    #     4      10      11  200
    #     5      10      14  200
    #     6      10      15  200
    #     7      11      14  200
    #     8      11      15  200
    #     9      14      15  200
    # Personally, I find the data frame above most suitable for further analysis.
    # The following code is needed only to make the output compatible with the author's one


    df2 <- df %>% 
            grouped_df(as.list(persons)) %>% 
            mutate(i.team=paste0('team',seq_along(team))) 
    #       Person1 Person2 team i.team
    #     1      10      11  100  team1
    #     2      10      12  100  team1
    #     3      11      12  100  team1
    #     4      10      11  200  team2
    #     5      10      14  200  team1
    #     6      10      15  200  team1
    #     7      11      14  200  team1
    #     8      11      15  200  team1
    #     9      14      15  200  team1

    # count number of teams per pair
    df2.count <- df %>% 
                    grouped_df(as.list(persons)) %>% 
                    summarize(cnt=length(team))

    # reshape the data
    df3 <- dcast(df2,
         as.formula(sprintf('%s~i.team',paste(persons,collapse='+'))),
         value.var='team'
         )

    df3$count <- df2.count$cnt
    df3
}
您应该获得所需的输出

通过更改
n
,您还可以找到三元组、四元组等。

这里有一个“data.table”解决方案,它似乎达到了您想要的目的(尽管有很多代码):


更新:上述内容的逐步版本 为了了解上述情况,这里有一个逐步的方法:

## The following would be a long data.table with 4 columns:
##   Team, V1, ind, and time
step1 <- as.data.table(d)[
  , combn(Person, 2), by = Team][
    , ind := paste0("Person", c(1, 2))][
      , time := sequence(.N), by = list(Team, ind)]
head(step1)
#    Team V1     ind time
# 1:  100 10 Person1    1
# 2:  100 11 Person2    1
# 3:  100 10 Person1    2
# 4:  100 12 Person2    2
# 5:  100 11 Person1    3
# 6:  100 12 Person2    3

## Here, we make the data "wide"
step2 <- dcast.data.table(step1, time + Team ~ ind, value.var = "V1")
step2
#    time Team Person1 Person2
# 1:    1  100      10      11
# 2:    1  200      10      11
# 3:    2  100      10      12
# 4:    2  200      10      14
# 5:    3  100      11      12
# 6:    3  200      10      15
# 7:    4  200      11      14
# 8:    5  200      11      15
# 9:    6  200      14      15

## Create a "count" column and a "time" column,
##   grouped by "Person1" and "Person2".
##   Count is for the count column.
##   Time is for going to a wide format
step3 <- step2[, c("count", "time") := list(.N, sequence(.N)), 
               by = list(Person1, Person2)]
step3
#    time Team Person1 Person2 count
# 1:    1  100      10      11     2
# 2:    2  200      10      11     2
# 3:    1  100      10      12     1
# 4:    1  200      10      14     1
# 5:    1  100      11      12     1
# 6:    1  200      10      15     1
# 7:    1  200      11      14     1
# 8:    1  200      11      15     1
# 9:    1  200      14      15     1

## The final step of going wide
out <- dcast.data.table(step3, Person1 + Person2 + count ~ time, 
                        value.var = "Team")
out
#    Person1 Person2 count   1   2
# 1:      10      11     2 100 200
# 2:      10      12     1 100  NA
# 3:      10      14     1 200  NA
# 4:      10      15     1 200  NA
# 5:      11      12     1 100  NA
# 6:      11      14     1 200  NA
# 7:      11      15     1 200  NA
# 8:      14      15     1 200  NA
##以下是一个包含4列的长data.table:
##团队、V1、ind和时间

第一步当人们说“我的数据像……”然后发布一个不允许将分析结果与预期结果进行比较的示例时,这真的很烦人。学习发布部分数据的输出和预期的准确输出。可能是dput(head(dat,10))的输出。
@bondedust,所需的输出与提供的数据完全对应。几乎完全。。。少了一排。第14人和第15人都是200队的队员。看起来你的答案和我的答案都是
join
。@Gregor我也这么认为。我对sql知之甚少,但有没有办法做
full\u-join
之类的事情?如果有的话,这对OP是一个很好的建议。我相信
FULL JOIN
是命令,但我现在将尝试Gabor的建议。@Gregor谢谢你的建议。你对加博建议的回答现在看起来很好。解释得很好。:)这是我今天学习data.table的资料+1@AnandaMahto好的。我尝试了
crossprod
,但它只给出了
计数
@AnandaMahto,你的回答启发了我学习数据。table@MaratTalipov,如果你想获得更多的灵感,如果你还没有这样做,请阅读。我想既然你使用的是
%%>%
,你的意思是
库(dplyr)
而不是
库(plyr)
?此外,是否有任何理由使用“重塑”而不是“重塑2”或“tidyr”(哪种方式更有效)?此修改将添加一个
Teams
列,其中包含一个以逗号分隔的团队列表:
sqldf(“选择dd1.Person Person1,dd2.Person Person2,count(*)count,group_concat(dd1.Team)dd-dd1内部的团队在dd1上加入dd-dd2。Team=dd2.Team和dd1.Person
 find.friends(d,n=2)
library(data.table)
dcast.data.table(
  dcast.data.table(
    as.data.table(d)[, combn(Person, 2), by = Team][
      , ind := paste0("Person", c(1, 2))][
        , time := sequence(.N), by = list(Team, ind)], 
    time + Team ~ ind, value.var = "V1")[
      , c("count", "time") := list(.N, sequence(.N)), by = list(Person1, Person2)],
  Person1 + Person2 + count ~ time, value.var = "Team")
#    Person1 Person2 count   1   2
# 1:      10      11     2 100 200
# 2:      10      12     1 100  NA
# 3:      10      14     1 200  NA
# 4:      10      15     1 200  NA
# 5:      11      12     1 100  NA
# 6:      11      14     1 200  NA
# 7:      11      15     1 200  NA
# 8:      14      15     1 200  NA
## The following would be a long data.table with 4 columns:
##   Team, V1, ind, and time
step1 <- as.data.table(d)[
  , combn(Person, 2), by = Team][
    , ind := paste0("Person", c(1, 2))][
      , time := sequence(.N), by = list(Team, ind)]
head(step1)
#    Team V1     ind time
# 1:  100 10 Person1    1
# 2:  100 11 Person2    1
# 3:  100 10 Person1    2
# 4:  100 12 Person2    2
# 5:  100 11 Person1    3
# 6:  100 12 Person2    3

## Here, we make the data "wide"
step2 <- dcast.data.table(step1, time + Team ~ ind, value.var = "V1")
step2
#    time Team Person1 Person2
# 1:    1  100      10      11
# 2:    1  200      10      11
# 3:    2  100      10      12
# 4:    2  200      10      14
# 5:    3  100      11      12
# 6:    3  200      10      15
# 7:    4  200      11      14
# 8:    5  200      11      15
# 9:    6  200      14      15

## Create a "count" column and a "time" column,
##   grouped by "Person1" and "Person2".
##   Count is for the count column.
##   Time is for going to a wide format
step3 <- step2[, c("count", "time") := list(.N, sequence(.N)), 
               by = list(Person1, Person2)]
step3
#    time Team Person1 Person2 count
# 1:    1  100      10      11     2
# 2:    2  200      10      11     2
# 3:    1  100      10      12     1
# 4:    1  200      10      14     1
# 5:    1  100      11      12     1
# 6:    1  200      10      15     1
# 7:    1  200      11      14     1
# 8:    1  200      11      15     1
# 9:    1  200      14      15     1

## The final step of going wide
out <- dcast.data.table(step3, Person1 + Person2 + count ~ time, 
                        value.var = "Team")
out
#    Person1 Person2 count   1   2
# 1:      10      11     2 100 200
# 2:      10      12     1 100  NA
# 3:      10      14     1 200  NA
# 4:      10      15     1 200  NA
# 5:      11      12     1 100  NA
# 6:      11      14     1 200  NA
# 7:      11      15     1 200  NA
# 8:      14      15     1 200  NA