R 对出现错误的行使用带重复标识符的排列

R 对出现错误的行使用带重复标识符的排列,r,dataframe,tidyr,spread,biomart,R,Dataframe,Tidyr,Spread,Biomart,我的数据如下所示: df <- read.table(header = T, text = "GeneID Gene_Name Species Paralogues Domains Functional_Diversity 1234 DDR1 hsapiens 14 2 8.597482 5678 CSNK1E celegans

我的数据如下所示:

df <- read.table(header = T, text =
        "GeneID    Gene_Name   Species    Paralogues    Domains   Functional_Diversity
         1234      DDR1        hsapiens   14            2         8.597482
         5678      CSNK1E      celegans   70            4         8.154788
         9104      FGF1        Chicken    3             0         5.455874
         4575      FGF1        hsapiens   4             6         6.745845")
我试过使用:

library(tidyverse)
df %>% 
    select(Gene_Name, Species, Functional_Diversity) %>% 
    spread(Species, Functional_Diversity)
我的实际数据包括130000行(许多基因名称约14000个唯一),由9个物种组成

当我将此方法应用于我的实际数据时,我得到:

Error: Duplicate identifiers for rows (16691, 19988), (20938, 21033), (1232, 21150), (2763, 21465), (1911, 20844), (17274, 17657, 18293, 18652, 18726, 19006, 19025), (496, 22555), (17227, 17608, 18211, 18605, 18676, 18967, 19002), (13569, 21807), (10261, 21014, 21607), (20816, 21553), (2244, 22025), (6194, 21910), (12217, 21555), (2936, 21078), (16484, 20911), (12216, 21851), (9289, 21791), (10340, 21752), (1714, 22077), (13216, 22618), (6076, 22371), (14731, 21717), (160, 22472), (11553, 22635), (17183, 17583, 18510, 18608, 18661, 18896, 19108), (138, 20028), (17185, 17584, 18330, 18415, 18500, 18981, 19063), (9726, 22440), (17238, 17617, 18905, 18960, 18996, 19134), (1638, 21645), (4631, 20821), (9162, 22463), (319, 20900), (13600, 22227), (9312, 20011), (14825, 21711, 21764), (3381, 21134), (505, 21133), (5954, 20013), (5948, 21313), (17233, 17612, 18187, 18311, 18411, 18708, 18980), (16953, 20902, 21845), (20710, 22477), (20519, 20973), (10204, 21197, 21213), (2933, 20707), (4302,

要仅查看具有“重复标识符”的行,可以使用

df %>% 
  group_by(Gene_Name, Species) %>% 
  mutate(n = n()) %>% 
  filter(n > 1)
为了确保
排列
有效,即使您有重复标识符的行,也可以添加一个行号列,以确保每一行都是唯一的

df %>% 
  select(Gene_Name, Species, Functional_Diversity) %>% 
  mutate(row = row_number()) %>% 
  spread(Species, Functional_Diversity)

这是
传播的复杂因素之一:结果中的每个点必须有1或0个数字。解决方案是添加一个索引列,但它应该是什么样子需要考虑。您可能重复使用了错误的
df
,并且
功能多样性
中存在打字错误。应该是
ddf
功能多样性
。在修复相关/可能的重复项后,它可以工作:
df %>% 
  select(Gene_Name, Species, Functional_Diversity) %>% 
  mutate(row = row_number()) %>% 
  spread(Species, Functional_Diversity)