在R中将混合字符串拆分为列
文本挖掘分析和R编码的新手 我有200个混合基因。我想把它们分开,在一列中粘贴字符串,如钙粘蛋白,孤儿受体,在另一列中粘贴数字,如2/3,数字+字符串,如7D,7TM。 我用strsplit来拆分单词。请注意,任何关于如何解析它们的建议都会有所帮助在R中将混合字符串拆分为列,r,string,split,text-mining,R,String,Split,Text Mining,文本挖掘分析和R编码的新手 我有200个混合基因。我想把它们分开,在一列中粘贴字符串,如钙粘蛋白,孤儿受体,在另一列中粘贴数字,如2/3,数字+字符串,如7D,7TM。 我用strsplit来拆分单词。请注意,任何关于如何解析它们的建议都会有所帮助 example: > Genes <- c("7D cadherins", "7TM orphan receptors", "7TM orphan receptors RNA18S", "28S ribosomal RNAs RNA2
example:
> Genes <- c("7D cadherins", "7TM orphan receptors", "7TM orphan receptors RNA18S", "28S ribosomal RNAs RNA28S", "45S pre-ribosomal RNAs RNA45S", "5.8S ribosomal RNAs", "Actin related protein 2/3 complex”)
Expected result(2nd and 3rd column):
7D cadherins cadherins 7D
7TM orphan receptors orphan receptors 7TM
18S ribosomal RNAs RNA18S ribosomal RNAs RNA18S 18S RNA18S
28S ribosomal RNAs RNA28S ribosomal RNAs RNA28S 28S RNA28S
45S pre-ribosomal RNAs RNA45S pre-ribosomal RNAs 45S RNA45S
5.8S ribosomal RNAs ribosomal RNAs 5.8S
Actin related protein 2/3 complex Actin related protein complex 2/3
使用strsplit分割名称,grep检测带数字或不带数字的单词,并粘贴以折叠单词。将所有内容放入函数中以避免重复:
wordS <- function(x, invert = TRUE) {
clean <- gsub( '[[:space:]]+', ' ', x ) # to remove extra spaces
split <- strsplit( clean, ' ' )
detec <- lapply( split, function(y) grep('[0-9]', y, invert = invert, value = TRUE) )
words <- sapply( detec, paste, collapse = ' ' )
return( words )
}
data.frame(
Gene = Genes,
column2 = wordS(Genes),
column3 = wordS(Genes, invert = FALSE)
)
Gene column2 column3
1 7D cadherins cadherins 7D
2 7TM orphan receptors orphan receptors 7TM
3 7TM orphan receptors RNA18S orphan receptors 7TM RNA18S
4 28S ribosomal RNAs RNA28S ribosomal RNAs 28S RNA28S
5 45S pre-ribosomal RNAs RNA45S pre-ribosomal RNAs 45S RNA45S
6 5.8S ribosomal RNAs ribosomal RNAs 5.8S
7 Actin related protein 2/3 complex Actin related protein complex 2/3
请显示预期的输出查看是否有帮助。@snoram:现在,我已经用预期的输出编辑了数据。@Rui Barradas:谢谢你的链接,但这不是我想要分割的。嗨,Try>str_extract_allGenes,'[A-Za-z]*[0-9]+[A-Za-z]*'