Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/70.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在R中的字符串后提取一定数量的单词或特殊字符_R_Regex_Stringr - Fatal编程技术网

在R中的字符串后提取一定数量的单词或特殊字符

在R中的字符串后提取一定数量的单词或特殊字符,r,regex,stringr,R,Regex,Stringr,我试图在特定字符串后提取一定数量的单词 library(stringr) x <- data.frame(end = c("source: from animal origin as Vitamin A / all-trans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, ce

我试图在特定字符串后提取一定数量的单词

library(stringr)

x <- data.frame(end = c("source: from animal origin as Vitamin A / all-trans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))

不管它是否将/和-算作单词,因为我总是可以将量词的数量调整为更多(在我的情况下,我不介意提取比我需要的更多的量词)


谢谢

我们可以用
{4,8}

trimws(stringr::str_extract(x$end, '(?<=source:\\s)(\\w+,?\\s){4,8}'))
问题是:是否有一个正则表达式包含特殊字符(或绕过它们),这样我仍然可以提取所需的单词?我注意到其他字符(例如-)或双空格也会出现同样的情况

边节点:这是一种XY问题()

你的问题不是,正则表达式不起作用-你的问题是正则表达式起作用了,但你期望一些不同的东西。您可以使用它来选择特定字符串后面的8个单词,但非单词(
/
)前面只有6个单词,因此这与您的模式不匹配

因此,为了给你的问题提供一个“答案”,你应该首先重做你的问题:

你的确切期望是什么


akrun的解决方案可以匹配4-8个单词中的任何一个,但怀疑这是您真正需要的。

这里有一个使用
regmatches
+
gsub

lapply(regmatches(u <- gsub(".*?source:\\s+?","",x$end),gregexpr("\\w+",u)),`[`,1:4)

您可以依赖与任何非空白字符匹配的
\S
速记字符类:

(?<=source:\s)\S+(?:\s+\S+){3,7}\b
输出:

[1] "from animal origin as Vitamin A / alltrans-Retinol"
[2] "Eggs, liver, certain fish species such as sardines"
[3] "Leafy green vegetables such as spinach; egg yolks" 

也许您需要
”(?谢谢。是的,我应该添加预期结果以澄清我的问题(现在添加)。尝试跳过非单词(如果有正则表达式)。或要提取特殊字符并继续提取后面的内容,例如\\w包含任何字母表,是否有其他正则表达式包含任何字母表或特殊字符?
pat <- sprintf('(?<=source:\\s)(\\w+,?\\s){%d}', c(8, 4))
library(dplyr)
do.call(coalesce, lapply(pat, function(y) trimws(stringr::str_extract(x$end, y))))
#[1] "from animal origin as"     
#[2] "Eggs, liver, certain fish species such as sardines,"
#[3] "Leafy green vegetables such"  
lapply(regmatches(u <- gsub(".*?source:\\s+?","",x$end),gregexpr("\\w+",u)),`[`,1:4)
[[1]]
[1] "from"   "animal" "origin" "as"

[[2]]
[1] "Eggs"    "liver"   "certain" "fish"

[[3]]
[1] "Leafy"      "green"      "vegetables" "such"
(?<=source:\s)\S+(?:\s+\S+){3,7}\b
library(stringr)
x <- data.frame(end = c("source: from animal origin as Vitamin A / alltrans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))
stringr::str_extract(x$end, "(?<=source:\\s)\\S+(?:\\s+\\S+){3,7}\\b")
[1] "from animal origin as Vitamin A / alltrans-Retinol"
[2] "Eggs, liver, certain fish species such as sardines"
[3] "Leafy green vegetables such as spinach; egg yolks"