R合并函数无法找到数据帧之间的共享匹配
嗨,我有以下两个数据帧:R合并函数无法找到数据帧之间的共享匹配,r,dataframe,merge,R,Dataframe,Merge,嗨,我有以下两个数据帧: # dataframe 1 --> clst1_trimmed > head(clst1_trimmed) # A tibble: 6 x 2 GeneName Clst.1 <fct> <dbl> 1 Cd74 1.20 2 Lyz2 1.02 3 Malat1 0.196 4 Ftl1 0.577 5 H2-Ab1 1.04 6 B2m 0.639`
# dataframe 1 --> clst1_trimmed
> head(clst1_trimmed)
# A tibble: 6 x 2
GeneName Clst.1
<fct> <dbl>
1 Cd74 1.20
2 Lyz2 1.02
3 Malat1 0.196
4 Ftl1 0.577
5 H2-Ab1 1.04
6 B2m 0.639`
# dataframe2 --> immgen_trimmed
> head(immgen_trimmed)
# A tibble: 6 x 6
ProbeSetID GeneName Description Cell.A Cell.B Cell.C
<int> <fct> <fct> <dbl> <dbl> <dbl>
1 10344620 Cd74 " predicted gene 10568" 15.6 15.3 17.2
2 10344622 Cd74 " predicted gene 10568" 240. 255. 224.
3 10344624 Lyz2 " lysophospholipase 1" 421. 474. 349.
4 10344633 Malat1 " transcription elongation factor A (SII) 1" 802. 950. 864.
5 10344637 Flt1 " ATPase H+ transporting lysosomal V1 subunit H" 199. 262. 167.
6 10344653 Cd3e " opioid receptor kappa 1" 14.8 12.8 18.0
但是,使用相同方法合并两个大数据帧失败:
> dim(sel_clst)
[1] 984 2
> dim(immgen_log2)
[1] 24922 212
merge2 <- merge(sel_clst, immgen_log2)
str(merged2)
'data.frame': 0 obs. of 213 variables:
$ GeneName : Factor w/ 984 levels "0610012G03Rik",..:
$ Cluster.1.Log2.Fold.Change : num
$ ProbeSetID : int
$ Description : Factor w/ 21246 levels " "," 1-acylglycerol-3-phosphate O-acyltransferase 1 (lysophosphatidic acid acyltransferase alpha)",..:
$ X.proB_CLP_BM. : num
$ X.proB_CLP_FL. : num
$ X.proB_FrA_BM. : num
知道为什么会失败吗?尝试一下(在备份这些数据帧之后):
有一个选项
-参数可以通过default.stringsAsFactors()
访问,它可以避免新手对因子创建的许多困惑,但是没有可以为strip.white
调整的默认设置
查看此成绩单:
> dat <- read.csv(text= "hd1 , hd2, hd3\n 1, a , c\n1,b,d\n")
> dat
hd1 hd2 hd3
1 1 a c
2 1 b d
> dput(dat)
structure(list(hd1 = c(1L, 1L), hd2 = structure(1:2, .Label = c(" a ",
"b"), class = "factor"), hd3 = structure(1:2, .Label = c(" c",
"d"), class = "factor")), .Names = c("hd1", "hd2", "hd3"), class = "data.frame", row.names = c(NA,
-2L))
> dat <- data.frame(
lapply(read.csv(text= "hd1 , hd2, hd3\n 1, a , c\n1,b,d\n"),
trimws)
)
# could also have used a two step process starting with the original `dat`
# dat[] <- lapply(dat, trimws) .... the `[]` preserves structure
> dat
hd1 hd2 hd3
1 1 a c
2 1 b d
> dput(dat)
structure(list(hd1 = structure(c(1L, 1L), .Label = "1", class = "factor"),
hd2 = structure(1:2, .Label = c("a", "b"), class = "factor"),
hd3 = structure(1:2, .Label = c("c", "d"), class = "factor")), .Names = c("hd1",
"hd2", "hd3"), row.names = c(NA, -2L), class = "data.frame")
>dat-dat
hd1 hd2 hd3
1 a c
2 1 b d
>dput(dat)
结构(列表(hd1=c(1L,1L),hd2=structure(1:2,Label=c(“a”),
“b”),class=“factor”),hd3=结构(1:2,.Label=c(“c”,
“d”),class=“factor”),.Names=c(“hd1”、“hd2”、“hd3”),class=“data.frame”,row.Names=c(NA,
-2L)
>dat dput(dat)
结构(列表(hd1=结构(c(1L,1L),.Label=“1”,class=“factor”),
hd2=结构(1:2,.Label=c(“a”,“b”),class=“factor”),
hd3=结构(1:2,.Label=c(“c”,“d”),class=“factor”),.Names=c(“hd1”,
“hd2”,“hd3”,row.names=c(NA,-2L),class=“data.frame”)
您是否注意到变量值中的前导空格?无论是“Cd74”
还是“Cd74”
都不匹配“Cd74”
。我有一个名为trim
的函数,用于删除前导空格和尾随空格。我建议首先强制所有关键列为“character”,然后在重新尝试匹配之前修剪您的值。也许还可以查看上游修复的数据导入命令。或者使用levels(df$var)@Moody\u Mudskipper:我总是对使用levelsYep有点怀疑,空间就是问题所在。我没听懂。我想随着这种认识,这个问题已经过时了你能在这里把trimws
作为参数传递给我的代码吗:immgen\u-dat也许:immgen\u-dat你能用几句话解释一下为什么你需要把数据框构造成不兼容的这样才能工作吗?lapply
jsut返回一个没有任何其他类属性的列表。使用data.frame
或as_tible
恢复“data.frame”类属性。我建议你调查一下fread。它更快更安全。您可以在read.table
中使用参数strip.white=TRUE
,我也可以在read.csv
中使用。
> "Cd74" %in% sel_clst$GeneName
[1] TRUE
> "Cd74" %in% immgen_log2$GeneName
[1] FALSE
levels(sel_clst$GeneName) <- trimws( levels( sel_clst$GeneName ))
levels(immgen_log2$GeneName) <- trimws( levels( immgen_log2$GeneName ))
merge2 <- merge(sel_clst, immgen_log2)
read.csv <-
function ( ...){ utils::read.csv(..., strip.white=TRUE) }
> dat <- read.csv(text= "hd1 , hd2, hd3\n 1, a , c\n1,b,d\n")
> dat
hd1 hd2 hd3
1 1 a c
2 1 b d
> dput(dat)
structure(list(hd1 = c(1L, 1L), hd2 = structure(1:2, .Label = c(" a ",
"b"), class = "factor"), hd3 = structure(1:2, .Label = c(" c",
"d"), class = "factor")), .Names = c("hd1", "hd2", "hd3"), class = "data.frame", row.names = c(NA,
-2L))
> dat <- data.frame(
lapply(read.csv(text= "hd1 , hd2, hd3\n 1, a , c\n1,b,d\n"),
trimws)
)
# could also have used a two step process starting with the original `dat`
# dat[] <- lapply(dat, trimws) .... the `[]` preserves structure
> dat
hd1 hd2 hd3
1 1 a c
2 1 b d
> dput(dat)
structure(list(hd1 = structure(c(1L, 1L), .Label = "1", class = "factor"),
hd2 = structure(1:2, .Label = c("a", "b"), class = "factor"),
hd3 = structure(1:2, .Label = c("c", "d"), class = "factor")), .Names = c("hd1",
"hd2", "hd3"), row.names = c(NA, -2L), class = "data.frame")