Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/82.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 如何基于输出列中的值提取列名并获取计数_R_Dataframe_Merge_Dplyr_Tidyr - Fatal编程技术网

R 如何基于输出列中的值提取列名并获取计数

R 如何基于输出列中的值提取列名并获取计数,r,dataframe,merge,dplyr,tidyr,R,Dataframe,Merge,Dplyr,Tidyr,我对R中的数据帧操作有一个疑问,即根据输出列中以逗号分隔的值提取列名并获得计数 我有一个输入文件,在A列中包含基因,在其他列中包含文献ID。输入文件的示例如下所示。我想要的是收集输出列中值为1的所有文献ID,并在计数列中计算ID的数量。输出文件的示例如下所示。在此之后,我将使用这个输出文件合并数据帧,并使用merge函数合并感兴趣的基因列表。请帮我做这个 Input_data <- read.csv(file = "./Input.csv", stringsAsFact

我对R中的数据帧操作有一个疑问,即根据输出列中以逗号分隔的值提取列名并获得计数

我有一个输入文件,在A列中包含基因,在其他列中包含文献ID。输入文件的示例如下所示。我想要的是收集输出列中值为1的所有文献ID,并在计数列中计算ID的数量。输出文件的示例如下所示。在此之后,我将使用这个输出文件合并数据帧,并使用merge函数合并感兴趣的基因列表。请帮我做这个

Input_data <- read.csv(file = "./Input.csv", stringsAsFactors = FALSE, check.names = FALSE)
Output_data <- read.csv(file = "./Output.csv", stringsAsFactors = FALSE, check.names = FALSE)
Genes <- read.csv(file = "./Genes.csv", stringsAsFactors = FALSE, check.names = FALSE)

Merge_data <- merge(Output_data, Genes, by = "Genes")


Input_data

dput(Input_data)
structure(list(Genes = c("Gene_A", "Gene_B", "Gene_C", "Gene_D", 
"Gene_E", "Gene_F", "Gene_G", "Gene_H", "Gene_I", "Gene_J", "Gene_K", 
"Gene_L", "Gene_M"), `20706538` = c(0L, 1L, 1L, 1L, 0L, 1L, 1L, 
1L, 0L, 0L, 0L, 0L, 0L), `14557386` = c(0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L), `22999554` = c(0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), `21906313` = c(1L, 1L, 1L, 1L, 
0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L), `25229268` = c(1L, 1L, 1L, 
0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), `22633082` = c(0L, 1L, 
1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), `19228761` = c(1L, 
1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), `19543402` = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), `26955776` = c(1L, 
1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), `21126355` = c(1L, 
1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, 
-13L))


Output_data

dput(Output_data)
structure(list(Genes = c("Gene_A", "Gene_B", "Gene_C", "Gene_D", 
"Gene_E", "Gene_F", "Gene_G", "Gene_H", "Gene_I", "Gene_J", "Gene_K", 
"Gene_L", "Gene_M"), Output = c("21906313, 25229268, 19228761, 26955776, 21126355", 
"20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355", 
"20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355", 
"20706538, 21906313, 22633082, 19228761, 26955776, 21126355", 
"", "20706538, 21906313, 25229268, 22633082, 26955776, 21126355", 
"20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355", 
"20706538, 21906313, 25229268, 22633082, 26955776, 21126355", 
"", "", "", "", "21906313, 21126355"), Counts = c(5L, 7L, 7L, 
6L, 0L, 6L, 7L, 6L, 0L, 0L, 0L, 0L, 2L)), class = "data.frame", row.names = c(NA, 
-13L))

Genes
dput(Genes)
structure(list(Genes = c("Gene_A", "Gene_B", "Gene_C", "Gene_D", 
"Gene_E", "Gene_F", "Gene_G", "Gene_H", "Gene_I", "Gene_J", "Gene_K", 
"Gene_L", "Gene_M", "Gene_N", "Gene_O", "Gene_P", "Gene_Q", "Gene_R", 
"Gene_S", "Gene_T", "Gene_U", "Gene_V", "Gene_W")), class = "data.frame", row.names = c(NA, 
-23L))

这是使用tidyr和dplyr包的可能解决方案

基本上,我们首先确保您的数据是可用的,也就是说,您可以使用pivot_longer函数以更简单的方式处理数据,然后我们应用非常标准的dplyr语句来创建所需的输出。如果您不熟悉它们,我建议您一次运行管道的一个步骤,并了解每个通道的功能

library(tidyr)
library(dplyr)

Input_data %>% 
  pivot_longer(-Genes, names_to = "num", values_to = "value") %>%
  group_by(Genes) %>% 
  mutate(
    Output = paste(num[value == 1], collapse = ", "),
    Counts = sum(value == 1)
    ) %>% 
  select(-c(num, value)) %>% 
  distinct() %>% 
  right_join(Genes, by = "Genes")
输出

# A tibble: 23 x 3
# Groups:   Genes [23]
#    Genes  Output                                                                 Counts
#    <chr>  <chr>                                                                  <int>
#  1 Gene_A "21906313, 25229268, 19228761, 26955776, 21126355"                         5
#  2 Gene_B "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"     7
#  3 Gene_C "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"     7
#  4 Gene_D "20706538, 21906313, 22633082, 19228761, 26955776, 21126355"               6
#  5 Gene_E ""                                                                         0
#  6 Gene_F "20706538, 21906313, 25229268, 22633082, 26955776, 21126355"               6
#  7 Gene_G "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"     7
#  8 Gene_H "20706538, 21906313, 25229268, 22633082, 26955776, 21126355"               6
#  9 Gene_I ""                                                                         0
# 10 Gene_J ""                                                                         0
# ... with 13 more rows

这是使用tidyr和dplyr包的可能解决方案

基本上,我们首先确保您的数据是可用的,也就是说,您可以使用pivot_longer函数以更简单的方式处理数据,然后我们应用非常标准的dplyr语句来创建所需的输出。如果您不熟悉它们,我建议您一次运行管道的一个步骤,并了解每个通道的功能

library(tidyr)
library(dplyr)

Input_data %>% 
  pivot_longer(-Genes, names_to = "num", values_to = "value") %>%
  group_by(Genes) %>% 
  mutate(
    Output = paste(num[value == 1], collapse = ", "),
    Counts = sum(value == 1)
    ) %>% 
  select(-c(num, value)) %>% 
  distinct() %>% 
  right_join(Genes, by = "Genes")
输出

# A tibble: 23 x 3
# Groups:   Genes [23]
#    Genes  Output                                                                 Counts
#    <chr>  <chr>                                                                  <int>
#  1 Gene_A "21906313, 25229268, 19228761, 26955776, 21126355"                         5
#  2 Gene_B "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"     7
#  3 Gene_C "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"     7
#  4 Gene_D "20706538, 21906313, 22633082, 19228761, 26955776, 21126355"               6
#  5 Gene_E ""                                                                         0
#  6 Gene_F "20706538, 21906313, 25229268, 22633082, 26955776, 21126355"               6
#  7 Gene_G "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"     7
#  8 Gene_H "20706538, 21906313, 25229268, 22633082, 26955776, 21126355"               6
#  9 Gene_I ""                                                                         0
# 10 Gene_J ""                                                                         0
# ... with 13 more rows

您的数据是宽格式的,这意味着一行/观测值具有多个值。当您的数据是长格式时更容易,这意味着每行只有一个值。看一看

我的解决方案与@Ric S非常相似,我使用Summary,而不是mutate,它适用于这样的情况,即您希望分组变量的每个级别都只有一个条目:

Input_data <- structure(list(Genes = c("Gene_A", "Gene_B", "Gene_C", "Gene_D", 
                         "Gene_E", "Gene_F", "Gene_G", "Gene_H", "Gene_I", "Gene_J", "Gene_K", 
                         "Gene_L", "Gene_M"), `20706538` = c(0L, 1L, 1L, 1L, 0L, 1L, 1L, 
                                                             1L, 0L, 0L, 0L, 0L, 0L), `14557386` = c(0L, 0L, 0L, 0L, 0L, 0L, 
                                                                                                     0L, 0L, 0L, 0L, 0L, 0L, 0L), `22999554` = c(0L, 0L, 0L, 0L, 0L, 
                                                                                                                                                 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), `21906313` = c(1L, 1L, 1L, 1L, 
                                                                                                                                                                                                 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L), `25229268` = c(1L, 1L, 1L, 
                                                                                                                                                                                                                                                     0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), `22633082` = c(0L, 1L, 
                                                                                                                                                                                                                                                                                                             1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), `19228761` = c(1L, 
                                                                                                                                                                                                                                                                                                                                                                         1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), `19543402` = c(0L, 
                                                                                                                                                                                                                                                                                                                                                                                                                                         0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), `26955776` = c(1L, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), `21126355` = c(1L, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               -13L))

Genes <- structure(list(Genes = c("Gene_A", "Gene_B", "Gene_C", "Gene_D", 
                                  "Gene_E", "Gene_F", "Gene_G", "Gene_H", "Gene_I", "Gene_J", "Gene_K", 
                                  "Gene_L", "Gene_M", "Gene_N", "Gene_O", "Gene_P", "Gene_Q", "Gene_R", 
                                  "Gene_S", "Gene_T", "Gene_U", "Gene_V", "Gene_W")), class = "data.frame", row.names = c(NA, 
                                                                                                                          -23L))

library(dplyr)
library(tidyr)

summary_data <- Input_data %>% 
  pivot_longer(-Genes, values_to = "is_contained", names_to = "literature_id") %>% 
  group_by(Genes) %>% 
  filter(is_contained == 1) %>% 
  summarise(Output = paste0(literature_id, collapse = ", "),
            Counts = n()) %>% 
  right_join(Genes) %>% 
  mutate(Output = if_else(is.na(Output),
                          "",
                          Output),
         Counts = if_else(is.na(Counts),
                          0L,
                          Counts))

summary_data
# A tibble: 23 x 3
   Genes  Output                                                                 Counts
   <chr>  <chr>                                                                   <int>
 1 Gene_A "21906313, 25229268, 19228761, 26955776, 21126355"                          5
 2 Gene_B "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"      7
 3 Gene_C "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"      7
 4 Gene_D "20706538, 21906313, 22633082, 19228761, 26955776, 21126355"                6
 5 Gene_E ""                                                                          0
 6 Gene_F "20706538, 21906313, 25229268, 22633082, 26955776, 21126355"                6
 7 Gene_G "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"      7
 8 Gene_H "20706538, 21906313, 25229268, 22633082, 26955776, 21126355"                6
 9 Gene_I ""                                                                          0
10 Gene_J ""                                                                          0
# ... with 13 more rows

您的数据是宽格式的,这意味着一行/观测值具有多个值。当您的数据是长格式时更容易,这意味着每行只有一个值。看一看

我的解决方案与@Ric S非常相似,我使用Summary,而不是mutate,它适用于这样的情况,即您希望分组变量的每个级别都只有一个条目:

Input_data <- structure(list(Genes = c("Gene_A", "Gene_B", "Gene_C", "Gene_D", 
                         "Gene_E", "Gene_F", "Gene_G", "Gene_H", "Gene_I", "Gene_J", "Gene_K", 
                         "Gene_L", "Gene_M"), `20706538` = c(0L, 1L, 1L, 1L, 0L, 1L, 1L, 
                                                             1L, 0L, 0L, 0L, 0L, 0L), `14557386` = c(0L, 0L, 0L, 0L, 0L, 0L, 
                                                                                                     0L, 0L, 0L, 0L, 0L, 0L, 0L), `22999554` = c(0L, 0L, 0L, 0L, 0L, 
                                                                                                                                                 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), `21906313` = c(1L, 1L, 1L, 1L, 
                                                                                                                                                                                                 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L), `25229268` = c(1L, 1L, 1L, 
                                                                                                                                                                                                                                                     0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), `22633082` = c(0L, 1L, 
                                                                                                                                                                                                                                                                                                             1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), `19228761` = c(1L, 
                                                                                                                                                                                                                                                                                                                                                                         1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), `19543402` = c(0L, 
                                                                                                                                                                                                                                                                                                                                                                                                                                         0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), `26955776` = c(1L, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), `21126355` = c(1L, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               -13L))

Genes <- structure(list(Genes = c("Gene_A", "Gene_B", "Gene_C", "Gene_D", 
                                  "Gene_E", "Gene_F", "Gene_G", "Gene_H", "Gene_I", "Gene_J", "Gene_K", 
                                  "Gene_L", "Gene_M", "Gene_N", "Gene_O", "Gene_P", "Gene_Q", "Gene_R", 
                                  "Gene_S", "Gene_T", "Gene_U", "Gene_V", "Gene_W")), class = "data.frame", row.names = c(NA, 
                                                                                                                          -23L))

library(dplyr)
library(tidyr)

summary_data <- Input_data %>% 
  pivot_longer(-Genes, values_to = "is_contained", names_to = "literature_id") %>% 
  group_by(Genes) %>% 
  filter(is_contained == 1) %>% 
  summarise(Output = paste0(literature_id, collapse = ", "),
            Counts = n()) %>% 
  right_join(Genes) %>% 
  mutate(Output = if_else(is.na(Output),
                          "",
                          Output),
         Counts = if_else(is.na(Counts),
                          0L,
                          Counts))

summary_data
# A tibble: 23 x 3
   Genes  Output                                                                 Counts
   <chr>  <chr>                                                                   <int>
 1 Gene_A "21906313, 25229268, 19228761, 26955776, 21126355"                          5
 2 Gene_B "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"      7
 3 Gene_C "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"      7
 4 Gene_D "20706538, 21906313, 22633082, 19228761, 26955776, 21126355"                6
 5 Gene_E ""                                                                          0
 6 Gene_F "20706538, 21906313, 25229268, 22633082, 26955776, 21126355"                6
 7 Gene_G "20706538, 21906313, 25229268, 22633082, 19228761, 26955776, 21126355"      7
 8 Gene_H "20706538, 21906313, 25229268, 22633082, 26955776, 21126355"                6
 9 Gene_I ""                                                                          0
10 Gene_J ""                                                                          0
# ... with 13 more rows
使用data.table:

使用data.table:


有了你的解决方案,我更好地理解了在这种情况下使用摘要而不是变异的优点,谢谢@starja+1@starja,谢谢你的解决方案,这非常有帮助。@starja,这对我的大规模数据很有帮助。我发现有数百个重复的基因具有相同的信息。我如何只提取我专栏中的独特基因而不是重复基因。谢谢你,Toufitry summary\u data%>%distinctGenes、.keep\u all=true使用你的解决方案,我更好地理解在这种情况下使用summary而不是变异的优点,谢谢@starja+1@starja,谢谢你的解决方案,这非常有帮助。@starja,这对我的大规模数据很有帮助。我发现有数百个重复的基因具有相同的信息。我如何只提取我专栏中的独特基因而不是重复基因。谢谢,Toufitry summary\u data%>%distinctGenes,.keep\u all=TRUE