Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/swift/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex 如何非常有效地从字符中提取特定模式?_Regex_R - Fatal编程技术网

Regex 如何非常有效地从字符中提取特定模式?

Regex 如何非常有效地从字符中提取特定模式?,regex,r,Regex,R,我有这样的大数据: > Data[1:7,1] [1] mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5 [2] mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9 [3] mature=hsa-miR-448|mir_Family=mir-448|Gene=OR4F5 [4] mature=hsa-miR-659-3p|mir_Family=-|Gene=OR4F5

我有这样的大数据:

> Data[1:7,1]
[1] mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5        
[2] mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
[3] mature=hsa-miR-448|mir_Family=mir-448|Gene=OR4F5   
[4] mature=hsa-miR-659-3p|mir_Family=-|Gene=OR4F5      
[5] mature=hsa-miR-5197-3p|mir_Family=-|Gene=OR4F5     
[6] mature=hsa-miR-5093|mir_Family=-|Gene=OR4F5        
[7] mature=hsa-miR-650|mir_Family=mir-650|Gene=OR4F5
我想做的是,在每一行中,我想选择单词mature=后的名称,以及单词Gene=后的名称,然后将它们与

paste(a,b, sep="-")
例如,前两行的预期输出如下:

hsa-miR-5087-OR4F5
hsa-miR-26a-1-3p-OR4F9
所以,最终的实现是这样的:

for(i in 1:nrow(Data)){
    Data[i,3] <- sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[i,1])
    Name <- strsplit(as.vector(Data[i,2]),"\\|")[[1]][2]
    Data[i,4] <- as.numeric(sub("pvalue=","",Name))
    print(i)
}
这很有效,但速度很慢。数据的大小非常大,有200000000行。这个实现非常缓慢。我怎样才能加快速度

这里有一种方法:

Data <- readLines(n = 7)
mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5        
mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
mature=hsa-miR-448|mir_Family=mir-448|Gene=OR4F5   
mature=hsa-miR-659-3p|mir_Family=-|Gene=OR4F5      
mature=hsa-miR-5197-3p|mir_Family=-|Gene=OR4F5     
mature=hsa-miR-5093|mir_Family=-|Gene=OR4F5        
mature=hsa-miR-650|mir_Family=mir-650|Gene=OR4F5
df <- read.table(sep = "|", text = Data, stringsAsFactors = FALSE)
l <- lapply(df, strsplit, "=")
trim <- function(x) gsub("^\\s*|\\s*$", "", x)
paste(trim(sapply(l[[1]], "[", 2)), trim(sapply(l[[3]], "[", 2)), sep = "-")
# [1] "hsa-miR-5087-OR4F5"     "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5"      "hsa-miR-659-3p-OR4F5"   "hsa-miR-5197-3p-OR4F5"  "hsa-miR-5093-OR4F5"    
# [7] "hsa-miR-650-OR4F5"

也许不是更优雅,但您可以尝试:

sapply(Data[,1],function(x){
                   parts<-strsplit(x,"\\|")[[1]]
                   y<-paste(gsub("(mature=)|(Gene=)","",parts[grepl("mature|Gene",parts)]),collapse="-")
                   return(y)
                })
范例


如果可以保证格式与指定的格式完全一致,则正则表达式可以捕获从等号到管道符号、从Gene=到末尾的所有内容,并将它们与减号粘贴在一起:

sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[,1])
另一个选项是使用read.table with=作为分隔符,然后粘贴两列:

res = read.table(text=txt,sep='=')
paste(sub('[|].*','',res$V2),            ## get rid from last part here
      sub('^ +| +$','',res$V4),sep='-')  ## remove extra spaces 

[1] "hsa-miR-5087-OR4F5"     "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5"      "hsa-miR-659-3p-OR4F5"  
[5] "hsa-miR-5197-3p-OR4F5"  "hsa-miR-5093-OR4F5"     "hsa-miR-650-OR4F5"   
已经给出的简单子解决方案看起来很不错,但为了以防万一,这里有一些其他方法:

1 read.pattern使用中的read.pattern,我们可以将数据解析为data.frame。这种中间形式DF可以通过多种方式进行操作。在这种情况下,我们使用粘贴的方式与问题中的方式基本相同:

library(gsubfn)
DF <- read.pattern(text = Data[, 1], pattern = "(\\w+)=([^|]*)")
paste(DF$V2, DF$V6, sep = "-")
生成的中间数据帧DF如下所示:

> DF
      V1               V2         V3      V4   V5    V6
1 mature     hsa-miR-5087 mir_Family       - Gene OR4F5
2 mature hsa-miR-26a-1-3p mir_Family  mir-26 Gene OR4F9
3 mature      hsa-miR-448 mir_Family mir-448 Gene OR4F5
4 mature   hsa-miR-659-3p mir_Family       - Gene OR4F5
5 mature  hsa-miR-5197-3p mir_Family       - Gene OR4F5
6 mature     hsa-miR-5093 mir_Family       - Gene OR4F5
7 mature      hsa-miR-650 mir_Family mir-650 Gene OR4F5
下面是我们使用的正则表达式的可视化:

(\w+)=([^|]*)
1a名称我们可以通过分别读取三列数据和三个名称使DF看起来更漂亮。这也改进了粘贴语句:

DF <- read.pattern(text = Data[, 1], pattern = "=([^|]*)")
names(DF) <- unlist(read.pattern(text = Data[1,1], pattern = "(\\w+)=", as.is = TRUE))

paste(DF$mature, DF$Gene, sep = "-") # same answer as above
2 Straplyc

使用相同包的另一种方法。这将提取a=后面的字段,而不包含|生成列表。然后,我们将第一个字段和第三个字段粘贴在一起:

sapply(strapplyc(Data[, 1], "=([^|]*)"), function(x) paste(x[1], x[3], sep = "-"))
给出相同的结果

下面是所用正则表达式的可视化:

=([^|]*)

我们没有数据,所以在制定类似的问题时,最好显示这样的数据:x您的编辑使这有点像一个移动的目标-最初并不清楚您是否需要一个计算效率高的解决方案。我鼓励您发布您自己对这个问题的答案,其中显示了一个相当大的数据集的基准,例如,对于下面提供的所有答案,请按照答案中给出的格式在数据集的前100000行上进行尝试。您还可以查看,它用于快速字符串处理,以及data.table和/或dplyr包
> DF
            mature mir_Family  Gene
1     hsa-miR-5087          - OR4F5
2 hsa-miR-26a-1-3p     mir-26 OR4F9
3      hsa-miR-448    mir-448 OR4F5
4   hsa-miR-659-3p          - OR4F5
5  hsa-miR-5197-3p          - OR4F5
6     hsa-miR-5093          - OR4F5
7      hsa-miR-650    mir-650 OR4F5
sapply(strapplyc(Data[, 1], "=([^|]*)"), function(x) paste(x[1], x[3], sep = "-"))
=([^|]*)