R 将文本检索到数据帧的两列时，正则表达式模式匹配出错_R_Regex_Perl_Dataframe

R 将文本检索到数据帧的两列时，正则表达式模式匹配出错

r regex perl dataframe

R 将文本检索到数据帧的两列时，正则表达式模式匹配出错,r,regex,perl,dataframe,R,Regex,Perl,Dataframe,考虑以下假设数据： x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row name

考虑以下假设数据：

x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data 
frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"


y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data 
frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : 
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"

z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"

df <- data.frame(Text = c(x, y, z), row.names = NULL, stringsAsFactors = F)

需要“y”的输出（因为在前三句话中找不到“：”，因此）：

与上述“y”的结果一样，“z”的所需输出结果应为：

  Col1    Col2
  NA      all of the text from 'z'

我想做的是：

resX <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[1]]), 
           Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[1]]))

resY <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[2]]), 
           Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[2]]))

resZ <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[3]]), 
           Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[3]]))

resX您可以尝试使用这个负前瞻正则表达式：
^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$


更新：
如果您的条件满足，那么正则表达式将返回true，您应该得到2部分
组1包含第一个之前的值：组2将包含第二个之后的值
如果不满足条件，则将整个字符串复制到第2列，并将所需内容作为第1列
一个更新的示例代码段包含一个名为process data的方法，它将为您提供这些技巧。如果满足条件，则它将分割数据并放入col1和col2。。。。如果输入中的y和z不符合条件。。。它会将NA放入col1，将整个值放入col2
运行示例源-->：
简明的
我的灵感来自于我的，所以你会看到他的答案完善了我的答案。我不喜欢的是，它在非句子开始时中断（例如行.names
——尽管提供的OP文本示例没有提供任何示例，其中行.names
在前两个句子中出现了3次，以展示这一点）。我还确保捕获组/列的编号与OP期望的完全一致，并且始终存在匹配。我的答案确实是Rizwan的一个改进
注1:我假设“句子”由句点/点定义，后跟至少一个水平空格

注2:这适用于PCRE正则表达式，未经其他正则表达式的测试，可能需要适应其他正则表达式才能正常工作（即if/else、垂直空格和水平空格标记）

代码


结果
输入
输出
匹配1

第一组：有一部恐怖电影在iNox剧院上演。
第2组：如果提供的行名长度为1，并且数据框只有一行，则使用row.names指定行名，而不是列（按名称或编号）。如果提供的行名称的长度为1，并且数据框只有一行，则使用row.names指定行名称，而不是列（按名称或编号）：请

第二场比赛

第1组：空-不匹配
第二组：在iNox剧院有一部恐怖电影。如果提供的行名称的长度为1，并且数据框只有一行，则采用row.names。指定行名称而不是列。按姓名或号码：如果提供的行名称的长度为1，并且数据框只有一行，则使用row.names指定行名称，而不是列（按名称或编号）：请

第三场比赛

第1组：空-不匹配
第二组：在iNox剧院有一部恐怖电影。如果提供的行名称的长度为1，并且数据框只有一行，则使用row.names指定行名称，而不是列（按名称或编号）。如果提供了长度为1的行名称：数据框只有一行，row.names用于指定行名，而不是列（按名称或编号）：请


解释

^
在字符串开头断言位置
（？（？！（？：[^:\v]*？\。\h）{3，}）（[^:\v]*？）\s*：\s*|）

（？（！…）x | y）
如果语句使用否定（？！…）作为条件

（？：[^:\v]*？\。\h）{3，}
至少匹配以下3次
[^:\v]*？
匹配集合中不存在的任何字符（不是冒号或垂直空白字符）任意次数，但尽可能少
\.\h
逐字匹配点字符，后跟水平空白字符（空格或制表符）
If语句true：如果满足上述条件，请执行以下操作
（[^:\v]*？）\s*：\s*

（[^:\v]*？）
捕获到组1：集合中不存在的任何字符（不是冒号或垂直空白字符）任意次数，但尽可能少
\s*：\s*
匹配任意数量的空格字符，后跟一个冒号，后跟任意数量的空格（注意，如果“句子”可能包含：
，则可以将*
更改为+
，以改进匹配过程）

如果语句为false：未满足前面的条件，请执行以下操作：不匹配任何内容


（.*）
捕获到第2组：任意字符（当s
标志关闭时不包括换行符）任意次数
$
在字符串末尾断言位置
分成句子；grep表示第一次出现的位置：
，并使用条件拆分原始文本：
sp <- strsplit(x, '(?<=\\.)(?=\\s+\\S)', perl = TRUE)[[1L]]
sp <- if (grep(':', sp)[1L] < 3L)
  sub(':\\s+', '$', x) else paste0('$', x)
sp <- gsub('\\v', '', sp, perl = TRUE)

str(read.table(text = sp, sep = '$', col.names = paste0('Col', 1:2), as.is = TRUE))

# 'data.frame': 1 obs. of  2 variables:
#   $ Col1: chr "There is a horror movie running in the iNox theater. "
#   $ Col2: chr "If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names an"| __truncated__

消极前瞻是昂贵的，而且很难阅读。这里有一个更简单的解决方案：
library(stringr)

# throw out everything after first :, and count the number of sentences
split = str_count(sub(':.*', '', df$Text), fixed('. ')) < 3

# assemble the required data (you could also avoid ifelse if really needed)
data.frame(col1 = ifelse(split, sub(':.*', '', df$Text), NA),
           col2 = ifelse(split, sub('.*?:', '', df$Text), df$Text))

库（stringr）
#在第一句之后扔掉所有的东西，数一数句子的数量
split=str_计数（sub（':.*'，''，df$Text），固定（'.'）<3
#组装所需的数据（如果确实需要，还可以避免使用ifelse）
frame（col1=ifelse（拆分，子（':.*'，''，df$Text），NA），
col2=ifelse（拆分，子（'.*？：'，''，df$Text），df$Text））
你的意思是这样的吗resZ@MadhuSareen我已经用一些示例源更新了答案，您可以查看一下。为什么它不适用于Y和Z？？？都是NA和NA。第一列应该是NA和t
library(stringr)

    x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data 
    frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"


    y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data 
    frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : 
    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"

    z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
    If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"             


df <- data.frame(Text = c(x, y, z), row.names = NULL, stringsAsFactors = F)

resDF <- data.frame("Col1" = character(), "Col2" = character(), stringsAsFactors=FALSE)

   processData <- function(a) {
        patt <- "^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$"    
        if(grepl(patt,a,perl=TRUE))
        {
            result<-str_match(a,patt)    
            col1<-result[2]
            col2<-result[3]
        }
        else
        {
            col1<-"NA"
            col2<-a
        }
       return(c(col1,col2))

    }



for (i in 1:nrow(df)){
tmp <- df[i, ]
resDF[nrow(resDF) + 1, ] <- processData(tmp)
}    


print(resDF)

                                                   Col1
1 There is a horror movie running in the iNox theater. 
2                                                    NA
3                                                    NA
                                                                                                                                                                                                                                                                                                                                                                                                                              Col2
1                                                        If row names are supplied of length one and the data \n    frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). \n    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify \n    the row names and not a column (by name or number) Can we go : Please
2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data \n    frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : \n    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify \n    the row names and not a column (by name or number) Can we go : Please
3      There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). \n    If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify \n    the row names and not a column (by name or number) Can we go : Please

^(?(?!(?:[^:\v]*?\.\h){3,})([^:\v]*?)\s*:\s*|)(.*)$

There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

sp <- strsplit(x, '(?<=\\.)(?=\\s+\\S)', perl = TRUE)[[1L]]
sp <- if (grep(':', sp)[1L] < 3L)
  sub(':\\s+', '$', x) else paste0('$', x)
sp <- gsub('\\v', '', sp, perl = TRUE)

str(read.table(text = sp, sep = '$', col.names = paste0('Col', 1:2), as.is = TRUE))

# 'data.frame': 1 obs. of  2 variables:
#   $ Col1: chr "There is a horror movie running in the iNox theater. "
#   $ Col2: chr "If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names an"| __truncated__

f <- function(text, end_of_sentence = '.', n = 3L, sep = '$') {
  p <- sprintf('(?<=[%s])(?=\\s+\\S)', end_of_sentence)

  sp <- strsplit(text, p, perl = TRUE)[[1L]]
  sp <- if (grep(':', sp)[1L] <= n)
    sub(':\\s+', sep, text) else paste0(sep, text)
  sp <- trimws(gsub('\\v', '', sp, perl = TRUE))

  read.table(text = sp, sep = sep, col.names = paste0('Col', 1:2),
             stringsAsFactors = FALSE)
}

## test
f(x); f(y); f(z)

## vectorize it to work on more than one string
f <- Vectorize(f, SIMPLIFY = FALSE, USE.NAMES = FALSE)

do.call('rbind', f(df$Text))

#   Col1
# 1 There is a horror movie running in the iNox theater. 
# 2                                                  <NA>
# 3                                                  <NA>
#   Col2
# 1 If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
# 2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
# 3 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

library(stringr)

# throw out everything after first :, and count the number of sentences
split = str_count(sub(':.*', '', df$Text), fixed('. ')) < 3

# assemble the required data (you could also avoid ifelse if really needed)
data.frame(col1 = ifelse(split, sub(':.*', '', df$Text), NA),
           col2 = ifelse(split, sub('.*?:', '', df$Text), df$Text))