基于行折叠数据帧并计算加权平均值r
我想折叠以下数据帧基于行折叠数据帧并计算加权平均值r,r,bioinformatics,collapse,R,Bioinformatics,Collapse,我想折叠以下数据帧 df Chromosome Start End lengthMB imba log2 Cn mCn Cn_ chr1 0 8022945 8.023 0.026905119 -0.001671481 2 1 1.99 chr1 8022945 9168284 1.145 0.030441784 0.000601976 2 1 2 chr1 9168284 9598904 0.431 NA -
df
Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 0 8022945 8.023 0.026905119 -0.001671481 2 1 1.99
chr1 8022945 9168284 1.145 0.030441784 0.000601976 2 1 2
chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01
chr2 0 8022930 8.023 0.026905119 -0.001671481 3 1 2.89
chr2 8022930 9168284 1.145 0.030441784 0.000601976 2 1 1.87
chr2 9168284 9598904 0.431 NA -0.024952441 2 1 1.57
chr2 9598904 31392788 21.794 0.036011994 0.002151497 2 0 1.87
chr2 31392788 35402000 1.164 0.029733771 0.003149921 2 1 2.01
chr3 0 8040000 1.479 NA 0.000969256 2 1 2
chr3 8040000 9168284 8.185 0.033499045 -0.031338811 1 0 0.89
chr3 9168284 9598904 3.952 0.036792754 0.002847936 1 0 0.78
chr3 9598904 31392788 0.883 0.049003807 -0.021413391 2 1 1.92
chr3 31392788 35402000 4.095 0.037653564 0.011944688 2 1 2.04
chr4 0 8022930 11.065 0.035092332 -0.022844471 2 1 1.91
chr4 8022930 9168284 40.635 0.037690844 0.006703603 2 1 2.02
chr4 9168284 9598904 0.545 0.047435696 -0.021068024 2 1 1.92
通过仅匹配具有相同Cn和mCn值的连续行,我希望折叠这些行。例如,对于前4行,我们有以下内容:
Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 0 8022945 8.023 0.026905119 -0.001671481 2 1 1.99
chr1 8022945 9168284 1.145 0.030441784 0.000601976 2 1 2
chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01
我想折叠具有相同Cn和mCn分数的连续行,因此对于前三行,Cn和mCn列上的每一行分别有一个“2”和“1”,并更改结束列以反映此折叠
Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 0 9598904 8.023 0.026905119 -0.001671481 2 1 1.99
但是我还想更改Cn\u列
,使其成为该行lengthMB
分数的加权平均值。因此,对于前三行,计算如下:
((8.023/9.599) * 1.99) + ((1.145/9.599) * 2) + ((0.431/9.599) * 1.91) = 1.987
前四条唯一染色体的输出:
Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 0 9598904 8.023 0.026905119 -0.001671481 2 1 1.99
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01
chr2 0 8022930 8.023 0.026905119 -0.001671481 3 1 2.89
chr2 8022930 9598904 1.145 0.030441784 0.000601976 2 1 1.79
chr2 9598904 31392788 21.794 0.036011994 0.002151497 2 0 1.87
chr2 31392788 35402000 1.164 0.029733771 0.003149921 2 1 2.01
chr3 0 8040000 1.479 NA 0.000969256 2 1 2
chr3 8040000 9598904 8.185 0.033499045 -0.031338811 1 0 0.836
chr3 9598904 35402000 0.883 0.049003807 -0.021413391 2 1 2.02
chr4 0 9598904 11.065 0.035092332 -0.022844471 2 1 2
尝试过这样的方法,但我也不知道如何计算
squish_segments <- function(sample) {
setDT(sample)[, .ind:= cumsum(c(TRUE,Start[-1]!=End[-.N])),
list(lengthMB, probes, snps, imba, log2, Cn, mCn, Cn_)][,
list(Chr=Chromosome[1], Start=Start[1], End=End[.N]),
list(lengthMB, probes, snps, imba, log2, Cn, mCn, Cn_, .ind)][,.ind:=NULL][]
}
squish_segments首先,请提供数据集的dput
输出,使您的问题更具再现性
我认为这是你在低水平上想要的
setkey(df, Chromosome, Cn, mCn, Start)
df[, list(
Start=min(Start),
End=max(End),
lengthMB=lengthMB[1],
imba=imba[1],
log2=log2[1],
Cn_=weighted.mean(Cn_, lengthMB)
), keyby=list(Chromosome, Cn , mCn)]
可以识别唯一的“事件”(具有相同Cn和mCn分数的连续行),然后简单地循环这些事件并相应地修改行。不是最有效率的,但应该做这项工作
txt <- "Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 8022945 9168284 1.145 0.030441784 0.000601976 2 1 2
chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01
chr2 0 8022930 8.023 0.026905119 -0.001671481 3 1 2.89
chr2 8022930 9168284 1.145 0.030441784 0.000601976 2 1 1.87
chr2 9168284 9598904 0.431 NA -0.024952441 2 1 1.57
chr2 9598904 31392788 21.794 0.036011994 0.002151497 2 0 1.87
chr2 31392788 35402000 1.164 0.029733771 0.003149921 2 1 2.01
chr3 0 8040000 1.479 NA 0.000969256 2 1 2
chr3 8040000 9168284 8.185 0.033499045 -0.031338811 1 0 0.89
chr3 9168284 9598904 3.952 0.036792754 0.002847936 1 0 0.78
chr3 9598904 31392788 0.883 0.049003807 -0.021413391 2 1 1.92
chr3 31392788 35402000 4.095 0.037653564 0.011944688 2 1 2.04
chr4 0 8022930 11.065 0.035092332 -0.022844471 2 1 1.91
chr4 8022930 9168284 40.635 0.037690844 0.006703603 2 1 2.02
chr4 9168284 9598904 0.545 0.047435696 -0.021068024 2 1 1.92"
df <- read.table(text=txt, header=T)
#identify each unique event
df$eventid <- with(df, cumsum(c(1,diff(as.numeric(factor(Chromosome)))!=0 | diff(Cn)!=0 | diff(mCn)!=0)))
#loop through events
for(i in 1:max(df$eventid)){
#identify rows in df with ith event
rows.i <- which(df$eventid == i)
df[rows.i,] <- within(df[rows.i,],{
#calculate values of interest and assign to first row of event
Start[1] <- min(Start)
End[1] <- max(End)
Cn_[1] <- sum((lengthMB/sum(lengthMB))*Cn_)
lengthMB[1] <- sum(lengthMB)
})
#drop all but first row
if(length(rows.i) > 1) df <- df[-rows.i[-1],]
} #end i
这是一种dplyr
方法
library(dplyr)
df = read.table(text=
"Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 0 8022945 8.023 0.026905119 -0.001671481 2 1 1.99
chr1 8022945 9168284 1.145 0.030441784 0.000601976 2 1 2
chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01
chr2 0 8022930 8.023 0.026905119 -0.001671481 3 1 2.89
chr2 8022930 9168284 1.145 0.030441784 0.000601976 2 1 1.87
chr2 9168284 9598904 0.431 NA -0.024952441 2 1 1.57
chr2 9598904 31392788 21.794 0.036011994 0.002151497 2 0 1.87
chr2 31392788 35402000 1.164 0.029733771 0.003149921 2 1 2.01
chr3 0 8040000 1.479 NA 0.000969256 2 1 2
chr3 8040000 9168284 8.185 0.033499045 -0.031338811 1 0 0.89
chr3 9168284 9598904 3.952 0.036792754 0.002847936 1 0 0.78
chr3 9598904 31392788 0.883 0.049003807 -0.021413391 2 1 1.92
chr3 31392788 35402000 4.095 0.037653564 0.011944688 2 1 2.04
chr4 0 8022930 11.065 0.035092332 -0.022844471 2 1 1.91
chr4 8022930 9168284 40.635 0.037690844 0.006703603 2 1 2.02
chr4 9168284 9598904 0.545 0.047435696 -0.021068024 2 1 1.92", header=T)
df %>%
mutate(Consec = ifelse(Chromosome == dplyr::lag(Chromosome, default = Chromosome[1]) & ## flag consecutive matching chromosomes
Cn == dplyr::lag(Cn, default = Cn[1]) &
mCn == dplyr::lag(mCn, default = mCn[1]), 0, 1),
Consec = cumsum(Consec)) %>% ## create an id for consecutive matching chromosomes
group_by(Chromosome, Cn, mCn, Consec) %>%
summarize(Cn_ = sum(lengthMB * Cn_)/sum(lengthMB),
Start = min(Start),
End = max(End),
lengthMB = first(lengthMB),
imba= first(imba),
log2= first(log2)) %>%
ungroup() %>% ## only if you want to ungroup
select(Chromosome,Start,End, lengthMB,imba,log2,Cn,mCn,Cn_) %>% ## to re arrange column order
arrange(Chromosome, Start)
# Chromosome Start End lengthMB imba log2 Cn mCn Cn_
# (fctr) (int) (int) (dbl) (dbl) (dbl) (int) (int) (dbl)
# 1 chr1 0 9598904 8.023 0.02690512 -0.001671481 2 1 1.9876008
# 2 chr1 9598904 31392788 21.794 0.03601199 0.002151497 3 1 3.0100000
# 3 chr2 0 8022930 8.023 0.02690512 -0.001671481 3 1 2.8900000
# 4 chr2 8022930 9598904 1.145 0.03044178 0.000601976 2 1 1.7879569
# 5 chr2 9598904 31392788 21.794 0.03601199 0.002151497 2 0 1.8700000
# 6 chr2 31392788 35402000 1.164 0.02973377 0.003149921 2 1 2.0100000
# 7 chr3 0 8040000 1.479 NA 0.000969256 2 1 2.0000000
# 8 chr3 8040000 9598904 8.185 0.03349904 -0.031338811 1 0 0.8541823
# 9 chr3 9598904 35402000 0.883 0.04900381 -0.021413391 2 1 2.0187143
# 10 chr4 0 9598904 11.065 0.03509233 -0.022844471 2 1 1.9956599
请注意,lag
是一个dplyr
函数,但也是一个stats
包函数。我必须编写dplyr::lag
,否则当我试图在lag
中指定default=
时会发生冲突。我不知道您或其他任何人是否可以复制此问题。如果我正确理解了您的问题,您可以使用数据在一行中完成。表
快速分组
library(data.table)
dt[, Cn_dependent := sum((lengthMB/sum(lengthMB)) * Cn_),
by = .(Chromosome, Cn, mCn)]
要获得此信息:
> dt
Chromosome Start End lengthMB imba log2 Cn mCn Cn_ Cn_dependent
1: chr1 0 8022945 8.023 0.02690512 -0.001671481 2 1 1.99 1.987601
2: chr1 8022945 9168284 1.145 0.03044178 0.000601976 2 1 2.00 1.987601
3: chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91 1.987601
4: chr1 9598904 31392788 21.794 0.03601199 0.002151497 3 1 3.01 3.010000
5: chr2 0 8022930 8.023 0.02690512 -0.001671481 3 1 2.89 2.890000
6: chr2 8022930 9168284 1.145 0.03044178 0.000601976 2 1 1.87 1.882285
7: chr2 9168284 9598904 0.431 NA -0.024952441 2 1 1.57 1.882285
8: chr2 9598904 31392788 21.794 0.03601199 0.002151497 2 0 1.87 1.870000
9: chr2 31392788 35402000 1.164 0.02973377 0.003149921 2 1 2.01 1.882285
要通过染色体
、Cn
和mCn
折叠,可以使用键和唯一
> setkey(dt, "Chromosome", "Cn", "mCn")
> unique(dt)
Chromosome Start End lengthMB imba log2 Cn mCn Cn_ Cn_dependent
1: chr1 0 8022945 8.023 0.02690512 -0.001671481 2 1 1.99 1.987601
2: chr1 9598904 31392788 21.794 0.03601199 0.002151497 3 1 3.01 3.010000
3: chr2 9598904 31392788 21.794 0.03601199 0.002151497 2 0 1.87 1.870000
4: chr2 8022930 9168284 1.145 0.03044178 0.000601976 2 1 1.87 1.882285
5: chr2 0 8022930 8.023 0.02690512 -0.001671481 3 1 2.89 2.890000
以下是数据的dput
> dput(dt)
structure(list(Chromosome = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L), .Label = c("chr1", "chr2"), class = "factor"), Start = c(0L,
8022945L, 9168284L, 9598904L, 0L, 8022930L, 9168284L, 9598904L,
31392788L), End = c(8022945L, 9168284L, 9598904L, 31392788L,
8022930L, 9168284L, 9598904L, 31392788L, 35402000L), lengthMB = c(8.023,
1.145, 0.431, 21.794, 8.023, 1.145, 0.431, 21.794, 1.164), imba = c(0.026905119,
0.030441784, NA, 0.036011994, 0.026905119, 0.030441784, NA, 0.036011994,
0.029733771), log2 = c(-0.001671481, 0.000601976, -0.024952441,
0.002151497, -0.001671481, 0.000601976, -0.024952441, 0.002151497,
0.003149921), Cn = c(2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L), mCn = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L), Cn_ = c(1.99, 2, 1.91, 3.01,
2.89, 1.87, 1.57, 1.87, 2.01)), .Names = c("Chromosome", "Start",
"End", "lengthMB", "imba", "log2", "Cn", "mCn", "Cn_"), class = c("data.table",
"data.frame"), row.names = c(NA, -9L), .internal.selfref = <pointer: 0x26abf68>)
>dput(dt)
结构(列表)染色体=结构(c(1L,1L,1L,1L,2L,2L,
2L,2L,2L),.Label=c(“chr1”,“chr2”),class=“factor”),Start=c(0L,
8022945L、9168284L、9598904L、0L、8022930L、9168284L、9598904L、,
31392788L),末端=c(8022945L,9168284L,9598904L,31392788L,
8022930L、9168284L、9598904L、31392788L、35402000L),长度B=c(8.023,
1.145,0.431,21.794,8.023,1.145,0.431,21.794,1.164),imba=c(0.026905119,
0.030441784,北美,0.036011994,0.026905119,0.030441784,北美,0.036011994,
0.029733771),log2=c(-0.001671481,0.000601976,-0.024952441,
0.002151497, -0.001671481, 0.000601976, -0.024952441, 0.002151497,
0.003149921),Cn=c(2L,2L,2L,3L,3L,2L,2L,2L,2L,2L),mCn=c(1L,
1L,1L,1L,1L,1L,1L,0L,1L),Cn=c(1.99,2,1.91,3.01,
2.89,1.87,1.57,1.87,2.01),名称=c(“染色体”,“起始”,
“End”、“lengthMB”、“imba”、“log2”、“Cn”、“mCn”、“Cn_389;”),class=c(“data.table”,
“data.frame”),row.names=c(NA,-9L),.internal.selfref=)
如何((8.023/9.599)*1.99)+((1.145/9.599)*2)+((0.431/9.599)*1.91)=1.9435
?我发现1.987601
。此外,当您折叠这些行时,您希望为包含不同信息的列保留哪些值?例如,Start
,End
,imba
,log2
。我想保留所有列我知道你想保留它们,但你没有指定你想要的值。例如,带有Cn=2
和mCn=1
的chr1
最初有3行,因此在这些列中有3个不同的值。此外,在您想要的输出中,您有一条染色体多次具有相同的Cn
和mCn
。检查第4行和第6行的chr2
,检查第7行和第9行的chr3
。看起来你没有因为某种原因折叠这些事件。是的,那是因为我只想折叠连续事件,而不仅仅是具有相同染色体的类似Cn和mCn唯一事件。这就是为什么第4行、第6行、第7行和第9行没有被折叠的原因。我很快就会看一看。嘿,我已经包括了接下来两个唯一染色体值的输出,因为这并没有给我我想要的输出!谢谢,我将根据您提供的新数据再次检查我的流程。但是,不管我怎么努力,我都不会让((8.023/9.599)*1.99)+((1.145/9.599)*2)+((0.431/9.599)*1.91)
等于1.9435
。
> dput(dt)
structure(list(Chromosome = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L), .Label = c("chr1", "chr2"), class = "factor"), Start = c(0L,
8022945L, 9168284L, 9598904L, 0L, 8022930L, 9168284L, 9598904L,
31392788L), End = c(8022945L, 9168284L, 9598904L, 31392788L,
8022930L, 9168284L, 9598904L, 31392788L, 35402000L), lengthMB = c(8.023,
1.145, 0.431, 21.794, 8.023, 1.145, 0.431, 21.794, 1.164), imba = c(0.026905119,
0.030441784, NA, 0.036011994, 0.026905119, 0.030441784, NA, 0.036011994,
0.029733771), log2 = c(-0.001671481, 0.000601976, -0.024952441,
0.002151497, -0.001671481, 0.000601976, -0.024952441, 0.002151497,
0.003149921), Cn = c(2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L), mCn = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L), Cn_ = c(1.99, 2, 1.91, 3.01,
2.89, 1.87, 1.57, 1.87, 2.01)), .Names = c("Chromosome", "Start",
"End", "lengthMB", "imba", "log2", "Cn", "mCn", "Cn_"), class = c("data.table",
"data.frame"), row.names = c(NA, -9L), .internal.selfref = <pointer: 0x26abf68>)