Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/75.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/71.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在R中清理和拆分HTML标记?_Html_R_Split - Fatal编程技术网

如何在R中清理和拆分HTML标记?

如何在R中清理和拆分HTML标记?,html,r,split,Html,R,Split,我的解析器创建一个数据帧,如下所示: name html 1 John <span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span> 2 Steve <span class="incident-ic

我的解析器创建一个数据帧,如下所示:

    name          html
 1  John         <span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span>
 2 Steve         <span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span>

regex是可能的,但我更喜欢
rvest

这对于data.table或dplyr来说更容易,但是让我们以R为基础来实现(这些都是新概念的可能性很小)

#示例数据
df%html\u属性)
#rbind以获取具有所有属性的data.frame
决赛
名称html类
1约翰68事件图标
2史蒂夫69事件图标
data.minute data.second data.id
1          68          37    8028
2          69           4  132205
让我们删除html,使其在查看器中更好一些:

> final$html <- NULL
> final
   name         class data.minute data.second data.id
1  John incident-icon          68          37    8028
2 Steve incident-icon          69           4  132205
>final$html final
name class data.minute data.second data.id
1约翰事件图标68 37 8028
2史蒂夫事件图标69 4 132205

regex是可能的,但我更喜欢
rvest

这对于data.table或dplyr来说更容易,但是让我们以R为基础来实现(这些都是新概念的可能性很小)

#示例数据
df%html\u属性)
#rbind以获取具有所有属性的data.frame
决赛
名称html类
1约翰68事件图标
2史蒂夫69事件图标
data.minute data.second data.id
1          68          37    8028
2          69           4  132205
让我们删除html,使其在查看器中更好一些:

> final$html <- NULL
> final
   name         class data.minute data.second data.id
1  John incident-icon          68          37    8028
2 Steve incident-icon          69           4  132205
>final$html final
name class data.minute data.second data.id
1约翰事件图标68 37 8028
2史蒂夫事件图标69 4 132205

如果您的问题中已经有了数据框,您可以尝试以下方法。您的数据帧在这里称为
mydf
。您可以使用
stri\u extract\u all\u regex()
提取所有数字。然后,按照经典方法将列表转换为数据帧。然后,指定新列名并将结果与原始数据框中的列
name
绑定

library(stringi)
library(dplyr)

stri_extract_all_regex(str = mydf$url, pattern = "[0-9]+") %>%
unlist %>%
matrix(ncol = 4, byrow = T) %>%
data.frame %>%
setNames(c("minute", "second", "ID", "data")) %>%
bind_cols(mydf["name"], .)

#   name minute second     ID data
#1  John     68     37   8028   68
#2 Steve     69      4 132205   69
资料


mydf如果您的问题中已经有了数据框,您可以尝试以下方法。您的数据帧在这里称为
mydf
。您可以使用
stri\u extract\u all\u regex()
提取所有数字。然后,按照经典方法将列表转换为数据帧。然后,指定新列名并将结果与原始数据框中的列
name
绑定

library(stringi)
library(dplyr)

stri_extract_all_regex(str = mydf$url, pattern = "[0-9]+") %>%
unlist %>%
matrix(ncol = 4, byrow = T) %>%
data.frame %>%
setNames(c("minute", "second", "ID", "data")) %>%
bind_cols(mydf["name"], .)

#   name minute second     ID data
#1  John     68     37   8028   68
#2 Steve     69      4 132205   69
资料


mydf使用
purr
dplyr
的替代
rvest
方法:

library(rvest)
library(purrr)
library(dplyr)

df <- read.table(stringsAsFactors=FALSE, header=TRUE, sep=",", text='name,html
John,<span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span>
Steve,<span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span>')

by_row(df, .collate="cols", 
       ~read_html(.$html) %>% 
         html_nodes("span:first-of-type") %>% 
         html_attrs() %>% 
         flatten_chr() %>% 
         as.list() %>% 
         flatten_df()) %>% 
  select(-html, -class1) %>% 
  setNames(gsub("^data-|1$", "", colnames(.)))
## # A tibble: 2 × 4
##    name minute second     id
##   <chr>  <chr>  <chr>  <chr>
## 1  John     68     37   8028
## 2 Steve     69      4 132205
库(rvest)
图书馆(purrr)
图书馆(dplyr)
df%
html_节点(“跨度:类型的第一个”)%>%
html_attrs()%>%
展平\u chr()%>%
as.list()%>%
展平_df())%>%
选择(-html,-class1)%>%
集合名(gsub(“^data-| 1$”,“”,colnames(.))
###A tible:2×4
##名称分钟秒id
##         
##约翰一世68 37 8028
##2史蒂夫69 4 132205

使用
purr
dplyr
的替代
rvest
方法:

library(rvest)
library(purrr)
library(dplyr)

df <- read.table(stringsAsFactors=FALSE, header=TRUE, sep=",", text='name,html
John,<span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span>
Steve,<span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span>')

by_row(df, .collate="cols", 
       ~read_html(.$html) %>% 
         html_nodes("span:first-of-type") %>% 
         html_attrs() %>% 
         flatten_chr() %>% 
         as.list() %>% 
         flatten_df()) %>% 
  select(-html, -class1) %>% 
  setNames(gsub("^data-|1$", "", colnames(.)))
## # A tibble: 2 × 4
##    name minute second     id
##   <chr>  <chr>  <chr>  <chr>
## 1  John     68     37   8028
## 2 Steve     69      4 132205
库(rvest)
图书馆(purrr)
图书馆(dplyr)
df%
html_节点(“跨度:类型的第一个”)%>%
html_attrs()%>%
展平\u chr()%>%
as.list()%>%
展平_df())%>%
选择(-html,-class1)%>%
集合名(gsub(“^data-| 1$”,“”,colnames(.))
###A tible:2×4
##名称分钟秒id
##         
##约翰一世68 37 8028
##2史蒂夫69 4 132205
library(rvest)
library(purrr)
library(dplyr)

df <- read.table(stringsAsFactors=FALSE, header=TRUE, sep=",", text='name,html
John,<span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span>
Steve,<span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span>')

by_row(df, .collate="cols", 
       ~read_html(.$html) %>% 
         html_nodes("span:first-of-type") %>% 
         html_attrs() %>% 
         flatten_chr() %>% 
         as.list() %>% 
         flatten_df()) %>% 
  select(-html, -class1) %>% 
  setNames(gsub("^data-|1$", "", colnames(.)))
## # A tibble: 2 × 4
##    name minute second     id
##   <chr>  <chr>  <chr>  <chr>
## 1  John     68     37   8028
## 2 Steve     69      4 132205