R 从两个具有特定条件的不同数据帧创建数据帧

R 从两个具有特定条件的不同数据帧创建数据帧,r,dataframe,reshape,R,Dataframe,Reshape,我想从两种不同类型的数据帧中创建一个数据帧,并附带一个条件,同时保留额外的列。我的第一个数据帧是: sample_id motif chromosome position 1 CT-G.A chr1 7300 1 TA-C.C chr1 1000 1 TC-G.C chr2

我想从两种不同类型的数据帧中创建一个数据帧,并附带一个条件,同时保留额外的列。我的第一个数据帧是:

    sample_id      motif    chromosome position   
        1         CT-G.A      chr1        7300        
        1         TA-C.C      chr1        1000        
        1         TC-G.C      chr2        1200        
        1         TC-G.C      chr2        3000        
        2         CG-A.T      chr2        12898       
        2         CA-G.T      chr2        234235      
geneID    chromosome   start     end       
  E1          chr1      100      10300            
  E2          chr1      1100     20122                   
  E3          chr2      1200     2000                         
  E4          chr2      400      234236              
  E5          chr2      12000    20000        
第二个数据帧是:

    sample_id      motif    chromosome position   
        1         CT-G.A      chr1        7300        
        1         TA-C.C      chr1        1000        
        1         TC-G.C      chr2        1200        
        1         TC-G.C      chr2        3000        
        2         CG-A.T      chr2        12898       
        2         CA-G.T      chr2        234235      
geneID    chromosome   start     end       
  E1          chr1      100      10300            
  E2          chr1      1100     20122                   
  E3          chr2      1200     2000                         
  E4          chr2      400      234236              
  E5          chr2      12000    20000        
然后我想创建一个具有以下条件的数据帧:

if (first$chromosome == second$chromosome & second$start<= first$position <= second$end)  

这会奏效的。但是,如果这样做,您可能需要考虑列标题

library(dplyr)
library(tidyr)

df1 %>% inner_join(df2, "chromosome") %>% 
  mutate(geneID_motif = paste(geneID, motif, sep = ","),
         n = if_else(position >= start & position <= end, 1, 0)) %>% 
  select(sample_id, geneID_motif, n) %>%
  group_by(sample_id, geneID_motif) %>% 
  summarise(n = sum(n)) %>%
  spread(key = geneID_motif, value = n, fill = 0)

# A tibble: 2 x 14
# Groups:   sample_id [2]
  sample_id `E1,CT-G.A` `E1,TA-C.C` `E2,CT-G.A` `E2,TA-C.C` `E3,CA-G.T` `E3,CG-A.T` `E3,TC-G.C` `E4,CA-G.T` `E4,CG-A.T` `E4,TC-G.C`
      <int>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
1         1        1.00        1.00        1.00           0           0           0        1.00        0           0           2.00
2         2        0           0           0              0           0           0        0           1.00        1.00        0   
# ... with 3 more variables: `E5,CA-G.T` <dbl>, `E5,CG-A.T` <dbl>, `E5,TC-G.C` <dbl>
库(dplyr)
图书馆(tidyr)
df1%>%内部连接(df2,“染色体”)%>%
突变(geneID_motif=粘贴(geneID,motif,sep=“,”),
n=如果其他(位置>=开始位置和位置%
选择(样本id,基因id,n)%>%
分组依据(样本id、基因id主题)%>%
总结(n=总和(n))%>%
排列(键=geneID_基序,值=n,填充=0)
#一个tibble:2x14
#分组:样本编号[2]
样本编号E1,CT-G.A``E1,TA-C.C``E2,CT-G.A``E2,TA-C.C``E3,CA-G.T``E3,CG-A.T``E3,TC-G.C``E4,CA-G.T``E4,CG-A.T``E4,TC-G.C``
1         1        1.00        1.00        1.00           0           0           0        1.00        0           0           2.00
2         2        0           0           0              0           0           0        0           1.00        1.00        0   
#…还有3个变量:`E5,CA-G.T`,`E5,CG-A.T`,`E5,TC-G.C`
数据:

  df1 <-
  structure(
    list(
      sample_id = c(1L, 1L, 1L, 1L, 2L, 2L),
      motif = c("CT-G.A", "TA-C.C", "TC-G.C", "TC-G.C", "CG-A.T", "CA-G.T"),
      chromosome = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"),
      position = c(7300L, 1000L, 1200L, 3000L, 12898L, 234235L)
    ),
    .Names = c("sample_id", "motif", "chromosome", "position"),
    class = "data.frame",
    row.names = c(NA,-6L)
  )
df2 <-
  structure(
    list(
      geneID = c("E1", "E2", "E3", "E4", "E5"),
      chromosome = c("chr1", "chr1", "chr2", "chr2", "chr2"),
      start = c(100L, 1100L, 1200L,400L, 12000L),
      end = c(10300L, 20122L, 2000L, 234236L, 20000L)
    ),
    .Names = c("geneID", "chromosome", "start", "end"),
    class = "data.frame",
    row.names = c(NA,-5L)
  )
df1希望这有帮助

库(dplyr)
图书馆(tidyr)
df1%>%
交叉(df2)%>%
突变(geneID_motif=粘贴(geneID,motif,sep=“,”),
flag=ifelse(开始%
分组依据(样本id、基因id主题)%>%
汇总(标志=as.integer(总和(标志)))%>%
排列(geneID_图案,旗帜)%>%
替换(is.na(.),0)%>%
data.frame(check.names=FALSE)
输出为:

  sample_id E1,CA-G.T E1,CG-A.T E1,CT-G.A E1,TA-C.C E1,TC-G.C E2,CA-G.T E2,CG-A.T E2,CT-G.A E2,TA-C.C E2,TC-G.C
1         1         0         0         1         1         0         0         0         1         0         0
2         2         0         0         0         0         0         0         0         0         0         0
  E3,CA-G.T E3,CG-A.T E3,CT-G.A E3,TA-C.C E3,TC-G.C E4,CA-G.T E4,CG-A.T E4,CT-G.A E4,TA-C.C E4,TC-G.C E5,CA-G.T
1         0         0         0         0         1         0         0         0         0         2         0
2         0         0         0         0         0         1         1         0         0         0         0
  E5,CG-A.T E5,CT-G.A E5,TA-C.C E5,TC-G.C
1         0         0         0         0
2         1         0         0         0
样本数据:

df1 <- structure(list(sample_id = c(1L, 1L, 1L, 1L, 2L, 2L), motif = c("CT-G.A", 
"TA-C.C", "TC-G.C", "TC-G.C", "CG-A.T", "CA-G.T"), chromosome1 = c("chr1", 
"chr1", "chr2", "chr2", "chr2", "chr2"), position = c(7300L, 
1000L, 1200L, 3000L, 12898L, 234235L)), .Names = c("sample_id", 
"motif", "chromosome1", "position"), class = "data.frame", row.names = c(NA, 
-6L))

df2 <- structure(list(geneID = c("E1", "E2", "E3", "E4", "E5"), chromosome2 = c("chr1", 
"chr1", "chr2", "chr2", "chr2"), start = c(100L, 1100L, 1200L, 
400L, 12000L), end = c(10300L, 20122L, 2000L, 234236L, 20000L
)), .Names = c("geneID", "chromosome2", "start", "end"), class = "data.frame", row.names = c(NA, 
-5L))

df1
internal_join
的输出方式与您发布所需输出的方式不同。只有在执行“笛卡尔连接”时,示例输出中显示的列数才能出现。当我运行它时,会显示:make.unique中出错(如.character(行)):尚不支持长向量:唯一。c:1575您可以粘贴生成上述错误的完整代码吗?仅供参考-在我的回答中,我已将“染色体”列重命名为
chromosome1
&
chromosome2
,但即使您不重命名列,此代码也不会给出错误(如果两个数据集中的列名相同,只需将
mutate
中的条件更改为
chromosome1
).BTW在读取
df1
df2
中的原始数据时,选项
stringsAsFactors=F
可能会有所帮助。我已经重命名了df1和df2中的染色体列,但它说:error in make.unique(as.character(rows)):尚不支持长向量:唯一。c:1575。我认为这是因为我的数据帧的长度。似乎是这样。我认为如果您还没有发现它,可能会有所帮助。我的数据创建了一个大矩阵。(例如238*(52000*96))。那么如何将此文件存储在.csv中file@user3585775,write.csv可能吗?@user3585775由于Excel中列的限制,您应该尝试
write.csv(t(您的_矩阵),“path”)