将团队重组为R中的个人级数据(同时保留团队级信息)
我当前的数据如下所示:将团队重组为R中的个人级数据(同时保留团队级信息),r,R,我当前的数据如下所示: Person Team 10 100 11 100 12 100 10 200 11 200 14 200 15 200 Person1 Person2 Count Team1 Team2 Team3 10 11 2 100 200 NA 10 12 1 100 NA NA 11
Person Team
10 100
11 100
12 100
10 200
11 200
14 200
15 200
Person1 Person2 Count Team1 Team2 Team3
10 11 2 100 200 NA
10 12 1 100 NA NA
11 12 1 100 NA NA
10 14 1 200 NA NA
10 15 1 200 NA NA
11 14 1 200 NA NA
11 15 1 200 NA NA
我想根据他们在一起的团队推断谁认识谁。我还想计算一下一个二人组在一个团队中的次数,并且我想跟踪将每对人联系起来的团队标识码。换句话说,我想创建一个如下所示的数据集:
Person Team
10 100
11 100
12 100
10 200
11 200
14 200
15 200
Person1 Person2 Count Team1 Team2 Team3
10 11 2 100 200 NA
10 12 1 100 NA NA
11 12 1 100 NA NA
10 14 1 200 NA NA
10 15 1 200 NA NA
11 14 1 200 NA NA
11 15 1 200 NA NA
生成的数据集捕获可以根据原始数据集中列出的团队推断的关系。“Count”变量反映了一对人员在一个团队中的实例数。“Team1”、“Team2”和“Team3”变量列出了将每对人员相互链接的团队ID。将哪个人员/团队ID列在第一位与第二位没有区别。团队的规模从2人到8人不等。通过自连接很容易获得计数,我认为使用
sqldf
最容易获得计数。(注意,我可能认为sqldf
最简单,因为我不太擅长数据。表
)编辑以包括@G.Grothendieck的建议:
我将把该列重命名留给您。在@Gregor之后,使用Gregor的数据,我尝试添加团队列。我不能提供你所要求的,但这可能有用。在
dplyr
(dplyr 0.4)的开发版本中使用full\u join
,我完成了以下操作。我使用combn
为每个团队创建了一个包含所有人员组合的数据框,并将数据保存为对象a
。然后,我将a
按团队分开,并使用full\u-join
。通过这种方式,我尝试创建团队列,至少为团队100
和200
。我使用rename
更改列名,并选择按您的方式对列进行排序
library(dplyr)
group_by(dd, Team) %>%
do(data.frame(t(combn(.$Person, 2)))) %>%
data.frame() ->a;
full_join(filter(a, Team == "100"), filter(a, Team == "200"), by = c("X1", "X2")) %>%
rename(Person1 = X1, Person2 = X2, Team1 = Team.x, Team2 = Team.y) %>%
select(Person1, Person2, Team1, Team2)
# Person1 Person2 Team1 Team2
#1 10 11 100 200
#2 10 12 100 NA
#3 11 12 100 NA
#4 10 14 NA 200
#5 10 15 NA 200
#6 11 14 NA 200
#7 11 15 NA 200
#8 14 15 NA 200
编辑
我相信有更好的方法可以做到这一点。但是,这是我能做的最接近的了。我尝试使用此版本中的另一个联接添加计数
group_by(dd, Team) %>%
do(data.frame(t(combn(.$Person, 2)))) %>%
data.frame() ->a;
full_join(filter(a, Team == "100"), filter(a, Team == "200"), by = c("X1", "X2")) %>%
full_join(count(a, X1, X2), by = c("X1", "X2")) %>%
rename(Person1 = X1, Person2 = X2, Team1 = Team.x, Team2 = Team.y, Count = n) %>%
select(Person1, Person2, Count, Team1, Team2)
# Person1 Person2 Count Team1 Team2
#1 10 11 2 100 200
#2 10 12 1 100 NA
#3 11 12 1 100 NA
#4 10 14 1 NA 200
#5 10 15 1 NA 200
#6 11 14 1 NA 200
#7 11 15 1 NA 200
#8 14 15 1 NA 200
以下是一般解决方案:
library(dplyr)
library(reshape2)
find.friends <- function(d,n=2) {
d$exist <- T
z <- dcast(d,Person~Team,value.var='exist')
# Person 100 200
# 1 10 TRUE TRUE
# 2 11 TRUE TRUE
# 3 12 TRUE NA
# 4 14 NA TRUE
# 5 15 NA TRUE
pairs.per.team <- sapply(
sort(unique(d$Team)),
function(team) {
non.na <- !is.na(z[,team])
if (sum(non.na)<n) return()
combns <- t(combn(z$Person[non.na],n))
cbind(combns,team)
}
)
df <- as.data.frame(do.call(rbind,pairs.per.team))
if (nrow(df)==0) return()
persons <- sprintf('Person%i',1:n)
colnames(df)[1:n] <- persons
# Person1 Person2 team
# 1 10 11 100
# 2 10 12 100
# 3 11 12 100
# 4 10 11 200
# 5 10 14 200
# 6 10 15 200
# 7 11 14 200
# 8 11 15 200
# 9 14 15 200
# Personally, I find the data frame above most suitable for further analysis.
# The following code is needed only to make the output compatible with the author's one
df2 <- df %>%
grouped_df(as.list(persons)) %>%
mutate(i.team=paste0('team',seq_along(team)))
# Person1 Person2 team i.team
# 1 10 11 100 team1
# 2 10 12 100 team1
# 3 11 12 100 team1
# 4 10 11 200 team2
# 5 10 14 200 team1
# 6 10 15 200 team1
# 7 11 14 200 team1
# 8 11 15 200 team1
# 9 14 15 200 team1
# count number of teams per pair
df2.count <- df %>%
grouped_df(as.list(persons)) %>%
summarize(cnt=length(team))
# reshape the data
df3 <- dcast(df2,
as.formula(sprintf('%s~i.team',paste(persons,collapse='+'))),
value.var='team'
)
df3$count <- df2.count$cnt
df3
}
您应该获得所需的输出
通过更改n
,您还可以找到三元组、四元组等。这里有一个“data.table”解决方案,它似乎达到了您想要的目的(尽管有很多代码):
更新:上述内容的逐步版本
为了了解上述情况,这里有一个逐步的方法:
## The following would be a long data.table with 4 columns:
## Team, V1, ind, and time
step1 <- as.data.table(d)[
, combn(Person, 2), by = Team][
, ind := paste0("Person", c(1, 2))][
, time := sequence(.N), by = list(Team, ind)]
head(step1)
# Team V1 ind time
# 1: 100 10 Person1 1
# 2: 100 11 Person2 1
# 3: 100 10 Person1 2
# 4: 100 12 Person2 2
# 5: 100 11 Person1 3
# 6: 100 12 Person2 3
## Here, we make the data "wide"
step2 <- dcast.data.table(step1, time + Team ~ ind, value.var = "V1")
step2
# time Team Person1 Person2
# 1: 1 100 10 11
# 2: 1 200 10 11
# 3: 2 100 10 12
# 4: 2 200 10 14
# 5: 3 100 11 12
# 6: 3 200 10 15
# 7: 4 200 11 14
# 8: 5 200 11 15
# 9: 6 200 14 15
## Create a "count" column and a "time" column,
## grouped by "Person1" and "Person2".
## Count is for the count column.
## Time is for going to a wide format
step3 <- step2[, c("count", "time") := list(.N, sequence(.N)),
by = list(Person1, Person2)]
step3
# time Team Person1 Person2 count
# 1: 1 100 10 11 2
# 2: 2 200 10 11 2
# 3: 1 100 10 12 1
# 4: 1 200 10 14 1
# 5: 1 100 11 12 1
# 6: 1 200 10 15 1
# 7: 1 200 11 14 1
# 8: 1 200 11 15 1
# 9: 1 200 14 15 1
## The final step of going wide
out <- dcast.data.table(step3, Person1 + Person2 + count ~ time,
value.var = "Team")
out
# Person1 Person2 count 1 2
# 1: 10 11 2 100 200
# 2: 10 12 1 100 NA
# 3: 10 14 1 200 NA
# 4: 10 15 1 200 NA
# 5: 11 12 1 100 NA
# 6: 11 14 1 200 NA
# 7: 11 15 1 200 NA
# 8: 14 15 1 200 NA
##以下是一个包含4列的长data.table:
##团队、V1、ind和时间
第一步当人们说“我的数据像……”然后发布一个不允许将分析结果与预期结果进行比较的示例时,这真的很烦人。学习发布部分数据的输出和预期的准确输出。可能是dput(head(dat,10))的输出。
@bondedust,所需的输出与提供的数据完全对应。几乎完全。。。少了一排。第14人和第15人都是200队的队员。看起来你的答案和我的答案都是join
。@Gregor我也这么认为。我对sql知之甚少,但有没有办法做full\u-join
之类的事情?如果有的话,这对OP是一个很好的建议。我相信FULL JOIN
是命令,但我现在将尝试Gabor的建议。@Gregor谢谢你的建议。你对加博建议的回答现在看起来很好。解释得很好。:)这是我今天学习data.table的资料+1@AnandaMahto好的。我尝试了crossprod
,但它只给出了计数
@AnandaMahto,你的回答启发了我学习数据。table@MaratTalipov,如果你想获得更多的灵感,如果你还没有这样做,请阅读。我想既然你使用的是%%>%
,你的意思是库(dplyr)
而不是库(plyr)
?此外,是否有任何理由使用“重塑”而不是“重塑2”或“tidyr”(哪种方式更有效)?此修改将添加一个Teams
列,其中包含一个以逗号分隔的团队列表:sqldf(“选择dd1.Person Person1,dd2.Person Person2,count(*)count,group_concat(dd1.Team)dd-dd1内部的团队在dd1上加入dd-dd2。Team=dd2.Team和dd1.Person
find.friends(d,n=2)
library(data.table)
dcast.data.table(
dcast.data.table(
as.data.table(d)[, combn(Person, 2), by = Team][
, ind := paste0("Person", c(1, 2))][
, time := sequence(.N), by = list(Team, ind)],
time + Team ~ ind, value.var = "V1")[
, c("count", "time") := list(.N, sequence(.N)), by = list(Person1, Person2)],
Person1 + Person2 + count ~ time, value.var = "Team")
# Person1 Person2 count 1 2
# 1: 10 11 2 100 200
# 2: 10 12 1 100 NA
# 3: 10 14 1 200 NA
# 4: 10 15 1 200 NA
# 5: 11 12 1 100 NA
# 6: 11 14 1 200 NA
# 7: 11 15 1 200 NA
# 8: 14 15 1 200 NA
## The following would be a long data.table with 4 columns:
## Team, V1, ind, and time
step1 <- as.data.table(d)[
, combn(Person, 2), by = Team][
, ind := paste0("Person", c(1, 2))][
, time := sequence(.N), by = list(Team, ind)]
head(step1)
# Team V1 ind time
# 1: 100 10 Person1 1
# 2: 100 11 Person2 1
# 3: 100 10 Person1 2
# 4: 100 12 Person2 2
# 5: 100 11 Person1 3
# 6: 100 12 Person2 3
## Here, we make the data "wide"
step2 <- dcast.data.table(step1, time + Team ~ ind, value.var = "V1")
step2
# time Team Person1 Person2
# 1: 1 100 10 11
# 2: 1 200 10 11
# 3: 2 100 10 12
# 4: 2 200 10 14
# 5: 3 100 11 12
# 6: 3 200 10 15
# 7: 4 200 11 14
# 8: 5 200 11 15
# 9: 6 200 14 15
## Create a "count" column and a "time" column,
## grouped by "Person1" and "Person2".
## Count is for the count column.
## Time is for going to a wide format
step3 <- step2[, c("count", "time") := list(.N, sequence(.N)),
by = list(Person1, Person2)]
step3
# time Team Person1 Person2 count
# 1: 1 100 10 11 2
# 2: 2 200 10 11 2
# 3: 1 100 10 12 1
# 4: 1 200 10 14 1
# 5: 1 100 11 12 1
# 6: 1 200 10 15 1
# 7: 1 200 11 14 1
# 8: 1 200 11 15 1
# 9: 1 200 14 15 1
## The final step of going wide
out <- dcast.data.table(step3, Person1 + Person2 + count ~ time,
value.var = "Team")
out
# Person1 Person2 count 1 2
# 1: 10 11 2 100 200
# 2: 10 12 1 100 NA
# 3: 10 14 1 200 NA
# 4: 10 15 1 200 NA
# 5: 11 12 1 100 NA
# 6: 11 14 1 200 NA
# 7: 11 15 1 200 NA
# 8: 14 15 1 200 NA