R 将文本检索到数据帧的两列时,正则表达式模式匹配出错

R 将文本检索到数据帧的两列时,正则表达式模式匹配出错,r,regex,perl,dataframe,R,Regex,Perl,Dataframe,考虑以下假设数据: x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row name

考虑以下假设数据:

x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data 
frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"


y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data 
frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : 
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"

z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"

df <- data.frame(Text = c(x, y, z), row.names = NULL, stringsAsFactors = F)
需要“y”的输出(因为在前三句话中找不到“:”,因此):

与上述“y”的结果一样,“z”的所需输出结果应为:

  Col1    Col2
  NA      all of the text from 'z'
我想做的是:

resX <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[1]]), 
           Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[1]]))

resY <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[2]]), 
           Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[2]]))

resZ <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[3]]), 
           Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[3]]))

resX您可以尝试使用这个负前瞻正则表达式:

^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$

更新:

如果您的条件满足,那么正则表达式将返回true,您应该得到2部分

组1包含第一个之前的值:组2将包含第二个之后的值

如果不满足条件,则将整个字符串复制到第2列,并将所需内容作为第1列

一个更新的示例代码段包含一个名为process data的方法,它将为您提供这些技巧。如果满足条件,则它将分割数据并放入col1和col2。。。。如果输入中的y和z不符合条件。。。它会将NA放入col1,将整个值放入col2

运行示例源-->

简明的 我的灵感来自于我的,所以你会看到他的答案完善了我的答案。我不喜欢的是,它在非句子开始时中断(例如
行.names
——尽管提供的OP文本示例没有提供任何示例,其中
行.names
在前两个句子中出现了3次,以展示这一点)。我还确保捕获组/列的编号与OP期望的完全一致,并且始终存在匹配。我的答案确实是Rizwan的一个改进

注1:我假设“句子”由句点/点定义,后跟至少一个水平空格

注2:这适用于PCRE正则表达式,未经其他正则表达式的测试,可能需要适应其他正则表达式才能正常工作(即if/else、垂直空格和水平空格标记)


代码


结果 输入 输出 匹配1

  • 第一组:
    有一部恐怖电影在iNox剧院上演。
  • 第2组:
    如果提供的行名长度为1,并且数据框只有一行,则使用row.names指定行名,而不是列(按名称或编号)。如果提供的行名称的长度为1,并且数据框只有一行,则使用row.names指定行名称,而不是列(按名称或编号):请
第二场比赛

  • 第1组:空-不匹配
  • 第二组:
    在iNox剧院有一部恐怖电影。如果提供的行名称的长度为1,并且数据框只有一行,则采用row.names。指定行名称而不是列。按姓名或号码:如果提供的行名称的长度为1,并且数据框只有一行,则使用row.names指定行名称,而不是列(按名称或编号):请
第三场比赛

  • 第1组:空-不匹配
  • 第二组:
    在iNox剧院有一部恐怖电影。如果提供的行名称的长度为1,并且数据框只有一行,则使用row.names指定行名称,而不是列(按名称或编号)。如果提供了长度为1的行名称:数据框只有一行,row.names用于指定行名,而不是列(按名称或编号):请

解释
  • ^
    在字符串开头断言位置
  • (?(?!(?:[^:\v]*?\。\h){3,})([^:\v]*?)\s*:\s*|)
    • (?(!…)x | y)
      如果语句使用否定
      (?!…)
      作为条件
      • (?:[^:\v]*?\。\h){3,}
        至少匹配以下3次
      • [^:\v]*?
        匹配集合中不存在的任何字符(不是冒号或垂直空白字符)任意次数,但尽可能少
      • \.\h
        逐字匹配点字符,后跟水平空白字符(空格或制表符)
      • If语句true:如果满足上述条件,请执行以下操作
      • ([^:\v]*?)\s*:\s*
        • ([^:\v]*?)
          捕获到组1:集合中不存在的任何字符(不是冒号或垂直空白字符)任意次数,但尽可能少
        • \s*:\s*
          匹配任意数量的空格字符,后跟一个冒号,后跟任意数量的空格(注意,如果“句子”可能包含
          ,则可以将
          *
          更改为
          +
          ,以改进匹配过程)
      • 如果语句为false:未满足前面的条件,请执行以下操作:不匹配任何内容
  • (.*)
    捕获到第2组:任意字符(当
    s
    标志关闭时不包括换行符)任意次数
  • $
    在字符串末尾断言位置

分成句子;grep表示第一次出现的位置
,并使用条件拆分原始文本:

sp <- strsplit(x, '(?<=\\.)(?=\\s+\\S)', perl = TRUE)[[1L]]
sp <- if (grep(':', sp)[1L] < 3L)
  sub(':\\s+', '$', x) else paste0('$', x)
sp <- gsub('\\v', '', sp, perl = TRUE)

str(read.table(text = sp, sep = '$', col.names = paste0('Col', 1:2), as.is = TRUE))

# 'data.frame': 1 obs. of  2 variables:
#   $ Col1: chr "There is a horror movie running in the iNox theater. "
#   $ Col2: chr "If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names an"| __truncated__

消极前瞻是昂贵的,而且很难阅读。这里有一个更简单的解决方案:

library(stringr)

# throw out everything after first :, and count the number of sentences
split = str_count(sub(':.*', '', df$Text), fixed('. ')) < 3

# assemble the required data (you could also avoid ifelse if really needed)
data.frame(col1 = ifelse(split, sub(':.*', '', df$Text), NA),
           col2 = ifelse(split, sub('.*?:', '', df$Text), df$Text))
库(stringr)
#在第一句之后扔掉所有的东西,数一数句子的数量
split=str_计数(sub(':.*','',df$Text),固定('.')<3
#组装所需的数据(如果确实需要,还可以避免使用ifelse)
frame(col1=ifelse(拆分,子(':.*','',df$Text),NA),
col2=ifelse(拆分,子('.*?:','',df$Text),df$Text))

你的意思是这样的吗
resZ@MadhuSareen我已经用一些示例源更新了答案,您可以查看一下。为什么它不适用于Y和Z???都是NA和NA。第一列应该是NA和t
library(stringr)

    x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data 
    frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"


    y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data 
    frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : 
    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"

    z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
    If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"             


df <- data.frame(Text = c(x, y, z), row.names = NULL, stringsAsFactors = F)

resDF <- data.frame("Col1" = character(), "Col2" = character(), stringsAsFactors=FALSE)

   processData <- function(a) {
        patt <- "^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$"    
        if(grepl(patt,a,perl=TRUE))
        {
            result<-str_match(a,patt)    
            col1<-result[2]
            col2<-result[3]
        }
        else
        {
            col1<-"NA"
            col2<-a
        }
       return(c(col1,col2))

    }



for (i in 1:nrow(df)){
tmp <- df[i, ]
resDF[nrow(resDF) + 1, ] <- processData(tmp)
}    


print(resDF)
                                                   Col1
1 There is a horror movie running in the iNox theater. 
2                                                    NA
3                                                    NA
                                                                                                                                                                                                                                                                                                                                                                                                                              Col2
1                                                        If row names are supplied of length one and the data \n    frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). \n    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify \n    the row names and not a column (by name or number) Can we go : Please
2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data \n    frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : \n    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify \n    the row names and not a column (by name or number) Can we go : Please
3      There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). \n    If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify \n    the row names and not a column (by name or number) Can we go : Please
^(?(?!(?:[^:\v]*?\.\h){3,})([^:\v]*?)\s*:\s*|)(.*)$
There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
sp <- strsplit(x, '(?<=\\.)(?=\\s+\\S)', perl = TRUE)[[1L]]
sp <- if (grep(':', sp)[1L] < 3L)
  sub(':\\s+', '$', x) else paste0('$', x)
sp <- gsub('\\v', '', sp, perl = TRUE)

str(read.table(text = sp, sep = '$', col.names = paste0('Col', 1:2), as.is = TRUE))

# 'data.frame': 1 obs. of  2 variables:
#   $ Col1: chr "There is a horror movie running in the iNox theater. "
#   $ Col2: chr "If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names an"| __truncated__
f <- function(text, end_of_sentence = '.', n = 3L, sep = '$') {
  p <- sprintf('(?<=[%s])(?=\\s+\\S)', end_of_sentence)

  sp <- strsplit(text, p, perl = TRUE)[[1L]]
  sp <- if (grep(':', sp)[1L] <= n)
    sub(':\\s+', sep, text) else paste0(sep, text)
  sp <- trimws(gsub('\\v', '', sp, perl = TRUE))

  read.table(text = sp, sep = sep, col.names = paste0('Col', 1:2),
             stringsAsFactors = FALSE)
}

## test
f(x); f(y); f(z)

## vectorize it to work on more than one string
f <- Vectorize(f, SIMPLIFY = FALSE, USE.NAMES = FALSE)

do.call('rbind', f(df$Text))

#   Col1
# 1 There is a horror movie running in the iNox theater. 
# 2                                                  <NA>
# 3                                                  <NA>
#   Col2
# 1 If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
# 2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
# 3 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
library(stringr)

# throw out everything after first :, and count the number of sentences
split = str_count(sub(':.*', '', df$Text), fixed('. ')) < 3

# assemble the required data (you could also avoid ifelse if really needed)
data.frame(col1 = ifelse(split, sub(':.*', '', df$Text), NA),
           col2 = ifelse(split, sub('.*?:', '', df$Text), df$Text))