R 从HTML标记获取类型、类和文本_R_Regex

R 从HTML标记获取类型、类和文本

r regex

R 从HTML标记获取类型、类和文本,r,regex,R,Regex,我有一个包含HTML标签的大数据，我需要这样的结果。但我不知道如何获得这个 text 1 <button type="button" class="btn btn-default">Default</button> 2 <button type="button" class="b

我有一个包含HTML标签的大数据，我需要这样的结果。但我不知道如何获得这个

                                                            text
1 <button type="button" class="btn btn-default">Default</button>
2 <button type="button" class="btn btn-primary">Primary</button>
3 <button type="button" class="btn btn-success">Success</button>

我们可以使用

extract

根据“text”列中的模式捕获子字符串，将该列拆分为多个列

library(tidyr)
extract(df1, text, into = c("type", "class", "text"),
         ".*(type=[^ ]+)\\s+([^>]+)>(\\w+).*")
#          type                   class    text
#1 type="button" class="btn btn-default" Default
#2 type="button" class="btn btn-primary" Primary
#3 type="button" class="btn btn-success" Success

解释

-字符

（type=[^]+）

-从“type=”和一个或多个非空格字符捕获子字符串

[^]+

\\s+

-一个或多个空格

（[^>]+）

-第二个捕获组，用于捕获一个或多个非

-字符

（\\w+）

-捕获单词的第三个捕获组

数据

df1我们可以使用extract
根据“text”列中的模式捕获子字符串，将该列拆分为多个列
library(tidyr)
extract(df1, text, into = c("type", "class", "text"),
         ".*(type=[^ ]+)\\s+([^>]+)>(\\w+).*")
#          type                   class    text
#1 type="button" class="btn btn-default" Default
#2 type="button" class="btn btn-primary" Primary
#3 type="button" class="btn btn-success" Success

解释
*
-字符
（type=[^]+）
-从“type=”和一个或多个非空格字符捕获子字符串[^]+

\\s+
-一个或多个空格
（[^>]+）
-第二个捕获组，用于捕获一个或多个非

-字符

（\\w+）
-捕获单词的第三个捕获组
数据
df1通常不赞成使用正则表达式解析xml，但如果需要，{unglue}提出了一个直观的解决方案（使用@akrun的数据）：
library（脱胶）
unglue_unnest（df1，文本“{text}”）
#>键入类文本
#>1 type=“button”class=“btn btn default”默认值
#>2 type=“button”class=“btn btn primary”primary
#>3 type=“button”class=“btn btn success”成功
通常不赞成用正则表达式解析xml，但如果需要，{unglue}提出了一个直观的解决方案（使用@akrun的数据）：
library（脱胶）
unglue_unnest（df1，文本“{text}”）
#>键入类文本
#>1 type=“button”class=“btn btn default”默认值
#>2 type=“button”class=“btn btn primary”primary
#>3 type=“button”class=“btn btn success”成功
你能给我解释一下吗：（[^>]+）。这对我来说是新的。@user14090295谢谢。补充解释。希望它能起作用。它起作用了。但是第二组我还不够清楚。@user14090295它是基于你展示的模式。i、 e.在第二组中，没有
字符。因此，它用于识别和捕获该组。您可以向我解释一下吗：（[^>]+）。这对我来说是新的。@user14090295谢谢。补充解释。希望它能起作用。它起作用了。但是第二组我还不够清楚。@user14090295它是基于你展示的模式。i、 e.在第二组中，没有字符。因此，它被用来识别和捕获该群体
df1 <- structure(list(text = c("<button type=\"button\" class=\"btn btn-default\">Default</button>", 
"<button type=\"button\" class=\"btn btn-primary\">Primary</button>", 
"<button type=\"button\" class=\"btn btn-success\">Success</button>"
)), class = "data.frame", row.names = c("1", "2", "3"))