基于列中的值范围在R中子集数据帧
我有一个包含多个列和行的数据框(df),例如:基于列中的值范围在R中子集数据帧,r,R,我有一个包含多个列和行的数据框(df),例如: A B C 0.6 a. b 0.9 c. d 1.1. e. f 1.2 g. h 1.4 I l 1.5. m. n 5.0 o. p 5.3 q. r 5.6. s. t 6.1. u v 6.5. w. z 6.9. y a 7.0
A B C
0.6 a. b
0.9 c. d
1.1. e. f
1.2 g. h
1.4 I l
1.5. m. n
5.0 o. p
5.3 q. r
5.6. s. t
6.1. u v
6.5. w. z
6.9. y a
7.0. b. c
我正在寻找的代码应该计算A列中每个连续值之间的差异(0.9-0.3=0.3,1.1-0.9=0.2等等),如果差异大于某个阈值(这里我们设置为3,但可以不同),它将子集一定数量的行(在这种情况下,假设为3,但也可以不同)在差异大于阈值设置的间隙之前和之后。
因此,在这种情况下,5.0-1.5=3.5,大于3,1.5之前的3行和5.0之后的3行将被保留,其余的将被删除。
你知道怎么写这样的代码吗
输出:
A B C
1.1. e. f
1.2 g. h
1.4 I l
1.5. m. n
5.0 o. p
5.3 q. r
5.6. s. t
6.1. u v
structure(list(POS = c(207691563L,
207693563L, 207694165L, 207694357L, 207738077L, 207739127L, 207740272L,
207740868L, ), SNP = c( "rs77357299", "rs12043913", "rs61822967",
"rs11117991", "rs7515905", "rs3886100", "rs12038575", "rs34883952",
), Std_iHS = c( 0.656487, -1.45251, 0.84325, -1.06089, -1.41041,
1.29513, 1.21325, 0.456717, )), row.names = 21:34, class = "data.frame")
我有多个数据帧,因此A列中的值不同,代码应该逐个查看每个数据帧,并根据阈值集查找A列中的间隙
数据采用dput
格式。
输入:data.framedf1
df1 <-
structure(list(A = c(0.6, 0.9, 1.1, 1.2, 1.4,
1.5, 4, 4.3, 4.6, 5.1, 5.5, 5.9, 6),
B = structure(c(1L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 2L), .Label = c("a.",
"b.", "c.", "e.", "g.", "I", "m.", "o.",
"q.", "s.", "u", "w.", "y"), class = "factor"),
C = structure(c(2L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 12L, 13L, 1L, 3L), .Label = c("a",
"b", "c", "d", "f", "h", "l", "n", "p",
"r", "t", "v", "z"), class = "factor")),
row.names = c(NA, -13L), class = "data.frame")
这是我的df:
structure(list(POS = c(207687374L, 207689227L, 207690871L, 207691563L,
207693563L, 207694165L, 207694357L, 207738077L, 207739127L, 207740272L,
207740868L, 207747296L, 207747984L, 207748107L), SNP = c("rs12130494",
"rs4844601", "rs10863358", "rs77357299", "rs12043913", "rs61822967",
"rs11117991", "rs7515905", "rs3886100", "rs12038575", "rs34883952",
"rs1752684", "rs17046851", "rs10127904"), Std_iHS = c(-1.52176,
-1.51905, -1.50286, 0.656487, -1.45251, 0.84325, -1.06089, -1.41041,
1.29513, 1.21325, 0.456717, -1.00933, -1.71468, 0.265969)), row.names =
21:34, class = "data.frame")
输出:
A B C
1.1. e. f
1.2 g. h
1.4 I l
1.5. m. n
5.0 o. p
5.3 q. r
5.6. s. t
6.1. u v
structure(list(POS = c(207691563L,
207693563L, 207694165L, 207694357L, 207738077L, 207739127L, 207740272L,
207740868L, ), SNP = c( "rs77357299", "rs12043913", "rs61822967",
"rs11117991", "rs7515905", "rs3886100", "rs12038575", "rs34883952",
), Std_iHS = c( 0.656487, -1.45251, 0.84325, -1.06089, -1.41041,
1.29513, 1.21325, 0.456717, )), row.names = 21:34, class = "data.frame")
看起来您的示例数据帧在3.0上没有任何跳转,但以下代码应该可以工作:
limit <- 2.0
structure(list(A = c(0.6, 0.9, 1.1, 1.2, 1.4,
1.5, 4, 4.3, 4.6, 5.1, 5.5, 5.9, 6),
B = structure(c(1L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 2L), .Label = c("a.",
"b.", "c.", "e.", "g.", "I", "m.", "o.",
"q.", "s.", "u", "w.", "y"), class = "factor"),
C = structure(c(2L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 12L, 13L, 1L, 3L), .Label = c("a",
"b", "c", "d", "f", "h", "l", "n", "p",
"r", "t", "v", "z"), class = "factor")),
row.names = c(NA, -13L), class = "data.frame") %>%
mutate(diffA = A - lag(A, 1)) %>%
mutate(over_limit = diffA > limit) %>%
mutate(before_limit = lag(over_limit, 1) | lag(over_limit, 2),
after_limit = lead(over_limit, 1) | lead(over_limit, 2)) %>%
rowwise() %>%
mutate(subset_filter = any(over_limit, after_limit, before_limit)) %>%
ungroup() %>%
filter(subset_filter) %>%
select(-c(subset_filter, diffA, over_limit, before_limit, after_limit))
看起来您的示例数据帧在3.0上没有任何跳转,但以下代码应该可以工作:
limit <- 2.0
structure(list(A = c(0.6, 0.9, 1.1, 1.2, 1.4,
1.5, 4, 4.3, 4.6, 5.1, 5.5, 5.9, 6),
B = structure(c(1L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 2L), .Label = c("a.",
"b.", "c.", "e.", "g.", "I", "m.", "o.",
"q.", "s.", "u", "w.", "y"), class = "factor"),
C = structure(c(2L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 12L, 13L, 1L, 3L), .Label = c("a",
"b", "c", "d", "f", "h", "l", "n", "p",
"r", "t", "v", "z"), class = "factor")),
row.names = c(NA, -13L), class = "data.frame") %>%
mutate(diffA = A - lag(A, 1)) %>%
mutate(over_limit = diffA > limit) %>%
mutate(before_limit = lag(over_limit, 1) | lag(over_limit, 2),
after_limit = lead(over_limit, 1) | lead(over_limit, 2)) %>%
rowwise() %>%
mutate(subset_filter = any(over_limit, after_limit, before_limit)) %>%
ungroup() %>%
filter(subset_filter) %>%
select(-c(subset_filter, diffA, over_limit, before_limit, after_limit))
使用base R,您可以执行以下操作:
limit = 2
df1[match(unique(c(sapply(which(diff(df1$A)>limit),function(x)(x-3):(x+4)))),1:nrow(df1)),]
A B C
3 1.1 e. f
4 1.2 g. h
5 1.4 I l
6 1.5 m. n
7 4.0 o. p
8 4.3 q. r
9 4.6 s. t
10 5.1 u v
使用base R,您可以执行以下操作:
limit = 2
df1[match(unique(c(sapply(which(diff(df1$A)>limit),function(x)(x-3):(x+4)))),1:nrow(df1)),]
A B C
3 1.1 e. f
4 1.2 g. h
5 1.4 I l
6 1.5 m. n
7 4.0 o. p
8 4.3 q. r
9 4.6 s. t
10 5.1 u v
另外,为什么要包括所需结果(
e.f
)中的第一行?在大型jumpsure之前有4行,我将尝试使用dput并添加一个可复制的数据集;是的,第1.1行。Ef应该包括在内,因为它应该在间隙的第一个值(1.5)前3行,在间隙的第二个值(4.0)后3行。看看这些问题,它们可能会帮助你:,是的,确切地说,我可以通过设置一个特定的值来获得这些行中有间隙位置的“which”索引,我试过了,我可以为一个数据帧这样做,但我的问题是,间隙前后的两个值总是不同的,这取决于数据帧,所以我不知道如何做到这一点,无论如何,谢谢,为什么要包括所需结果(e.f
)中的第一行?在大型jumpsure之前有4行,我将尝试使用dput并添加一个可复制的数据集;是的,第1.1行。Ef应该包括在内,因为它应该在间隙的第一个值(1.5)前3行,在间隙的第二个值(4.0)后3行。看看这些问题,它们可能会帮助你:,是的,确切地说,我可以通过设置一个特定的值来获得这些行中有间隙位置的“which”索引,我试过了,我可以在一个数据帧上这样做,但我的问题是,间隙前后的两个值总是不同的,这取决于数据帧,所以我不知道怎么做,无论如何,非常感谢,我正在尝试这段代码,非常有用谢谢,我正在尝试这段代码,非常有用