优化R代码以在字符串中存在子字符串时写入记录
我有:优化R代码以在字符串中存在子字符串时写入记录,r,string,dataframe,substring,R,String,Dataframe,Substring,我有: 一个数据帧测试变量,包含ODB6、ODID、Gene和其他一些列 一种数据帧,用结构号、细节和顺序排列数据。 (序列数据不相关) 我想知道arranged_data$Details是否包含test_var$ODB6_OD_ID中的任何值。也就是说,arranged_data$Details中的字符串是否包含test_var$ODB6_OD_ID中的任何值的子字符串。如果序列存在,则将相应ODB6_OD_ID的test_var$Gene_ID附加到文件中 我必须为每一个建筑编号做这件事。大
ODB6_OG_ID start Gene_ID
EOG60024F chrXR_group6 FBgn0247618
EOG60024H chr4_group3 FBgn0070413
EOG60024K chr2 FBgn0078093
EOG60024M chr2 FBgn0243975
EOG60024V chr4_group5 FBgn0247694
EOG60025C chrXL_group1a FBgn0247949
EOG60025F chr3 FBgn0245234
EOG602XCD chr4_group3 FBgn0080574
EOG602XCQ chr4_group3 FBgn0078791
数据包括:
architecture_number Details
1 chr317678741767875EOG6HQF5814.8092+47
1 chr325176942517695EOG6NKCGX23.1869-87
1 chr391494069149407EOG6NZVDZ2.96183+105
1 chr246642624664263EOG6Z638J1.52323+138
1 chr4_group3231407231408EOG6QRHQP4.65431-721
1 chr311648221164823EOG6X3HNJ2.28484+96
1 chr333466933346694EOG66WZW582.1698+678
1 chrXR_group854636745463675EOG6XH0KP1.86172+57
1 chr283746518374652EOG6V17MG2.45409-68
1 chr31338293913382940EOG63XVQR1.60785+105
所需输出:
FBgn0247618
FBgn0070413
FBgn0078093等
(这些不符合规定。)
其他资料:
操作系统:Ubuntu Xenial Xerus 16.04
R版本:3.3.0
RStudio版本:0.99.902
df$new <- gsub('_.*', '', df$start)
df1$new <- ifelse(grepl('_', df1$Details), gsub('_.*', '', df1$Details),
substring(df1$Details, 1, 4))
df1$Gene_ID <- sapply(df1$new, function(i) df$Gene_ID[match(i, df$new)])
df1
# architecture_number Details new Gene_ID
#1 1 chr317678741767875EOG6HQF5814.8092+47 chr3 FBgn0245234
#2 1 chr325176942517695EOG6NKCGX23.1869-87 chr3 FBgn0245234
#3 1 chr391494069149407EOG6NZVDZ2.96183+105 chr3 FBgn0245234
#4 1 chr246642624664263EOG6Z638J1.52323+138 chr2 FBgn0078093
#5 1 chr4_group3231407231408EOG6QRHQP4.65431-721 chr4 FBgn0070413
#6 1 chr311648221164823EOG6X3HNJ2.28484+96 chr3 FBgn0245234
#7 1 chr333466933346694EOG66WZW582.1698+678 chr3 FBgn0245234
#8 1 chrXR_group854636745463675EOG6XH0KP1.86172+57 chrXR FBgn0247618
#9 1 chr283746518374652EOG6V17MG2.45409-68 chr2 FBgn0078093
#10 1 chr31338293913382940EOG63XVQR1.60785+105 chr3 FBgn0245234
df$new请提供一个。此外,将数据保存在数据帧中,然后写入磁盘可能比每次迭代时写入磁盘更快。
df$new <- gsub('_.*', '', df$start)
df1$new <- ifelse(grepl('_', df1$Details), gsub('_.*', '', df1$Details),
substring(df1$Details, 1, 4))
df1$Gene_ID <- sapply(df1$new, function(i) df$Gene_ID[match(i, df$new)])
df1
# architecture_number Details new Gene_ID
#1 1 chr317678741767875EOG6HQF5814.8092+47 chr3 FBgn0245234
#2 1 chr325176942517695EOG6NKCGX23.1869-87 chr3 FBgn0245234
#3 1 chr391494069149407EOG6NZVDZ2.96183+105 chr3 FBgn0245234
#4 1 chr246642624664263EOG6Z638J1.52323+138 chr2 FBgn0078093
#5 1 chr4_group3231407231408EOG6QRHQP4.65431-721 chr4 FBgn0070413
#6 1 chr311648221164823EOG6X3HNJ2.28484+96 chr3 FBgn0245234
#7 1 chr333466933346694EOG66WZW582.1698+678 chr3 FBgn0245234
#8 1 chrXR_group854636745463675EOG6XH0KP1.86172+57 chrXR FBgn0247618
#9 1 chr283746518374652EOG6V17MG2.45409-68 chr2 FBgn0078093
#10 1 chr31338293913382940EOG63XVQR1.60785+105 chr3 FBgn0245234