R中的正则表达式选择以新行结尾的句子_R_Regex

R中的正则表达式选择以新行结尾的句子

r regex

R中的正则表达式选择以新行结尾的句子,r,regex,R,Regex,我的理解是R使用扩展正则表达式或类似Perl的正则表达式。我在SO和web上搜索了这个正则表达式问题的解决方案，但结果是空的：在R中，我有一个文本文件向量。每个元素由几个段落组成。我想从每个元素中提取几个句子，用这个文本子集创建一个新的向量。我想摘录的句子遵循一种可预测的模式 text <- c("AND \n \n house notes: text text/text.\n \n text text \n text", "AND \n \n notes: text

我的理解是R使用扩展正则表达式或类似Perl的正则表达式。我在SO和web上搜索了这个正则表达式问题的解决方案，但结果是空的：

在R中，我有一个文本文件向量。每个元素由几个段落组成。我想从每个元素中提取几个句子，用这个文本子集创建一个新的向量。我想摘录的句子遵循一种可预测的模式

text <- c("AND \n \n house notes: text text/text.\n \n text text \n text",
          "AND \n \n notes: text text/text.\n \n text text \n text",
          "AND \n \n house: text text/text.\n \n text text \n text")

我可以让它在

\w++注释中工作：\w++\w*++[^\uuw][^:\]*++\\\w

但不是R.

您应该注意，您测试了一个带有文本

\n

（反斜杠+

）的字符串，并且使用了PCRE正则表达式（

\w++/code>包含所有格量词）您需要在基本R正则表达式函数中使用perl=TRUE
，才能使用这样的正则表达式
因为您只想从特定字符串到换行符获取文本，所以最好的模式是一组备选方案，然后是一个否定字符类（匹配任何字符，但\n
）和一个换行符：
> text <- c("AND \n \n house notes: text text/text.\n \n text text \n text",
+           "AND \n \n notes: text text/text.\n \n text text \n text",
+           "AND \n \n house: text text/text.\n \n text text \n text")
> 
> pat = "(house( notes)?|notes):[^\n]*\n"
> regmatches(text, gregexpr(pat, text))
[[1]]
[1] "house notes: text text/text.\n"

[[2]]
[1] "notes: text text/text.\n"

[[3]]
[1] "house: text text/text.\n"

>文本
>pat=“（房屋（注释）？|注释）：[^\n]*\n”
>regmatches（text，gregexpr（pat，text））
[[1]]
[1] “房屋注释：文本/文本。\n”
[[2]]
[1] “注意：文本/文本。\n”
[[3]]
[1] “房屋：文本/文本。\n”

详情：

（house（notes）| notes）
-与house
、house notes
或notes
匹配的组
：
-冒号
[^\n]*
-与除换行符以外的任何字符匹配的否定字符类
\n
-换行符
gsub（'\n（[^\n]+：[^\n]+）\n |.，'\\1'，text）
> text <- c("AND \n \n house notes: text text/text.\n \n text text \n text",
+           "AND \n \n notes: text text/text.\n \n text text \n text",
+           "AND \n \n house: text text/text.\n \n text text \n text")
> 
> pat = "(house( notes)?|notes):[^\n]*\n"
> regmatches(text, gregexpr(pat, text))
[[1]]
[1] "house notes: text text/text.\n"

[[2]]
[1] "notes: text text/text.\n"

[[3]]
[1] "house: text text/text.\n"