在R中转换HTML字符实体编码在R中有没有转换HTML字符实体编码的方法？_Html_R_Encoding_Character Encoding

在R中转换HTML字符实体编码在R中有没有转换HTML字符实体编码的方法？
html r encoding character-encoding
在R中转换HTML字符实体编码在R中有没有转换HTML字符实体编码的方法？,html,r,encoding,character-encoding,Html,R,Encoding,Character Encoding,我想转换HTML字符实体，如 &至和或至对于Perl，有一个包HTML:：Entities可以做到这一点，但我在R中找不到类似的东西我也尝试了iconv（），但没有得到令人满意的结果。也许还有一种方法可以使用XML包，但我还没有找到它。更新：这个答案已经过时了。请检查基于新xml2包的尝试以下几点： # load XML package library(XML) # Convenience function to convert html codes html2txt <
我想转换HTML字符实体，如
&至和或
至

对于Perl，有一个包HTML:：Entities可以做到这一点，但我在R中找不到类似的东西
我也尝试了iconv（）
，但没有得到令人满意的结果。也许还有一种方法可以使用XML
包，但我还没有找到它。
更新：这个答案已经过时了。请检查基于新xml2包的

尝试以下几点：
# load XML package
library(XML)

# Convenience function to convert html codes
html2txt <- function(str) {
      xpathApply(htmlParse(str, asText=TRUE),
                 "//body//text()", 
                 xmlValue)[[1]] 
}

# html encoded string
( x <- paste("i", "s", "n", "&", "a", "p", "o", "s", ";", "t", sep = "") )
[1] "isn&apos;t"

# converted string
html2txt(x)
[1] "isn't"

#加载XML包
库（XML）
#转换html代码的便利函数
HTML2Text使用xml2
package查看xml/html值：
unescape_xml <- function(str){
  xml2::xml_text(xml2::read_xml(paste0("<x>", str, "</x>")))
}

unescape_html <- function(str){
  xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))
}

unescape_xml在执行此任务时，它的缺点是它没有矢量化，因此如果应用于大量字符，速度会很慢。此外，它仅适用于长度为1的字符向量，对于较长的字符向量，必须使用sapply

为了演示这一点，我首先创建一个大字符向量：
set.seed(123)
strings <- c("abcd", "&amp; &apos; &gt;", "&amp;", "&euro; &lt;")
many_strings <- sample(strings, 10000, replace = TRUE)

当然，您需要注意，用于组合str
中各种字符串的字符串（在我的示例中为“#|”
）不会出现在str
中的任何位置。否则，当最后再次拆分大字符串时，您将引入一个错误。
根据答案，我开始对函数进行基准测试
# first create large vector as in Stibu's answer
set.seed(123)
strings <- c("abcd", "&amp; &apos; &gt;", "&amp;", "&euro; &lt;")
many_strings <- sample(strings, 10000, replace = TRUE)

# then benchmark the functions by Stibu and Jeroen
bench::mark(
  textutils::HTMLdecode(many_strings),
  map_chr(many_strings, unescape_html),
  unescape_html2(many_strings)
)

# A tibble: 3 x 13
  expression                                min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time 
  <bch:expr>                           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis>
1 textutils::HTMLdecode(many_strings)  855.02ms 855.02ms     1.17   329.18MB    10.5      1     9   855.02ms <chr … <Rpro… <bch…
2 map_chr(many_strings, unescape_html)    1.09s    1.09s     0.919    6.79MB     5.51     1     6      1.09s <chr … <Rpro… <bch…
3 unescape_html2(many_strings)           4.85ms   5.13ms   195.     581.48KB     0       98     0   503.63ms <chr … <Rpro… <bch…
# … with 1 more variable: gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 

但是，当处理多个\u字符串
对象时，此函数失败（可能是因为读取\u xml
无法读取欧元符号。因此，我必须尝试另一种基准测试方法
library(tidyverse)
library(rvest)

entity_html <- read_html("https://dev.w3.org/html5/html-author/charref")
entity_mapping <- entity_html %>% 
  html_node(css = "table") %>% 
  html_table() %>% 
  rename(text = X1,
         named = X2,
         hex = X3, 
         dec = X4,
         desc = X5) %>% 
  as_tibble
s2 <- entity_mapping %>% pull(dec) # dec can be replaced by hex or named

bench::mark(
  textutils::HTMLdecode(s2),
  map_chr(s2, unescape_xml),
  map_chr(s2, unescape_html),
  unescape_xml2(s2),
  unescape_html2(s2)
)

# A tibble: 5 x 13
  expression                      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory   time   gc    
  <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>   <list> <list>
1 textutils::HTMLdecode(s2)   191.7ms  194.9ms      5.16    64.1MB    10.3      3     6      582ms <chr … <Rprofm… <bch:… <tibb…
2 map_chr(s2, unescape_xml)    73.8ms   80.9ms     11.9   1006.9KB     5.12     7     3      586ms <chr … <Rprofm… <bch:… <tibb…
3 map_chr(s2, unescape_html)  162.4ms  163.7ms      5.83  1006.9KB     5.83     3     3      514ms <chr … <Rprofm… <bch:… <tibb…
4 unescape_xml2(s2)           459.2µs    473µs   2034.      37.9KB     2.00  1017     1      500ms <chr … <Rprofm… <bch:… <tibb…
5 unescape_html2(s2)            590µs  607.5µs   1591.      37.9KB     2.00   796     1      500ms <chr … <Rprofm… <bch:… <tibb…
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 

这里的xml
版本甚至比html
版本更快。
这在我的设备上引发了一个错误。可以通过搜索Rhelp找到另一种方法：我得到了以下错误：`xml内容似乎不是xml，也不能识别文件名`is&apos；t``很好的答案！您介意解释一下
是需要的吗？如果那不是正确的地方，我很乐意提出一个新问题。我只是让Jeroen的答案更有效，基本的想法是他的，而不是我的。如果你在没有
和
的情况下尝试代码，你会注意到它失败并出现错误。原因是阅读了html（）
可用于包含HTML代码的字符串或HTML文件的路径。如果字符不包含（即，不包含单个HTML标记），则函数会假定它正在使用路径，并尝试读取一个当然不存在的文件。
unescape_html2 <- function(str){
  html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")
  parsed <- xml2::xml_text(xml2::read_html(html))
  strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}

system.time(res2 <- unescape_html2(many_strings))
##    user  system elapsed 
##   0.011   0.000   0.010 
identical(res, res2)
## [1] TRUE

# first create large vector as in Stibu's answer
set.seed(123)
strings <- c("abcd", "&amp; &apos; &gt;", "&amp;", "&euro; &lt;")
many_strings <- sample(strings, 10000, replace = TRUE)

# then benchmark the functions by Stibu and Jeroen
bench::mark(
  textutils::HTMLdecode(many_strings),
  map_chr(many_strings, unescape_html),
  unescape_html2(many_strings)
)

# A tibble: 3 x 13
  expression                                min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time 
  <bch:expr>                           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis>
1 textutils::HTMLdecode(many_strings)  855.02ms 855.02ms     1.17   329.18MB    10.5      1     9   855.02ms <chr … <Rpro… <bch…
2 map_chr(many_strings, unescape_html)    1.09s    1.09s     0.919    6.79MB     5.51     1     6      1.09s <chr … <Rpro… <bch…
3 unescape_html2(many_strings)           4.85ms   5.13ms   195.     581.48KB     0       98     0   503.63ms <chr … <Rpro… <bch…
# … with 1 more variable: gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 

unescape_xml2 <- function(str){
  html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")
  parsed <- xml2::xml_text(xml2::read_xml(html))
  strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}

library(tidyverse)
library(rvest)

entity_html <- read_html("https://dev.w3.org/html5/html-author/charref")
entity_mapping <- entity_html %>% 
  html_node(css = "table") %>% 
  html_table() %>% 
  rename(text = X1,
         named = X2,
         hex = X3, 
         dec = X4,
         desc = X5) %>% 
  as_tibble
s2 <- entity_mapping %>% pull(dec) # dec can be replaced by hex or named

bench::mark(
  textutils::HTMLdecode(s2),
  map_chr(s2, unescape_xml),
  map_chr(s2, unescape_html),
  unescape_xml2(s2),
  unescape_html2(s2)
)

# A tibble: 5 x 13
  expression                      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory   time   gc    
  <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>   <list> <list>
1 textutils::HTMLdecode(s2)   191.7ms  194.9ms      5.16    64.1MB    10.3      3     6      582ms <chr … <Rprofm… <bch:… <tibb…
2 map_chr(s2, unescape_xml)    73.8ms   80.9ms     11.9   1006.9KB     5.12     7     3      586ms <chr … <Rprofm… <bch:… <tibb…
3 map_chr(s2, unescape_html)  162.4ms  163.7ms      5.83  1006.9KB     5.83     3     3      514ms <chr … <Rprofm… <bch:… <tibb…
4 unescape_xml2(s2)           459.2µs    473µs   2034.      37.9KB     2.00  1017     1      500ms <chr … <Rprofm… <bch:… <tibb…
5 unescape_html2(s2)            590µs  607.5µs   1591.      37.9KB     2.00   796     1      500ms <chr … <Rprofm… <bch:… <tibb…
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 

> bench::mark(
+   # gsubreplace_mapping(s2, entity_mapping),
+   # gsubreplace_local(s2),
+   textutils::HTMLdecode(s3),
+   map_chr(s3, unescape_xml),
+   map_chr(s3, unescape_html),
+   unescape_xml2(s3),
+   unescape_html2(s3)
+ )

# A tibble: 5 x 13
  expression                      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory   time   gc    
  <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>   <list> <list>
1 textutils::HTMLdecode(s3)   204.2ms  212.3ms      4.72    64.1MB     7.87     3     5      636ms <chr … <Rprofm… <bch:… <tibb…
2 map_chr(s3, unescape_xml)    76.4ms   80.2ms     11.8   1006.9KB     5.04     7     3      595ms <chr … <Rprofm… <bch:… <tibb…
3 map_chr(s3, unescape_html)  164.6ms  165.3ms      5.80  1006.9KB     5.80     3     3      518ms <chr … <Rprofm… <bch:… <tibb…
4 unescape_xml2(s3)           487.4µs  500.5µs   1929.      74.5KB     2.00   965     1      500ms <chr … <Rprofm… <bch:… <tibb…
5 unescape_html2(s3)          611.1µs  627.7µs   1574.      40.4KB     0      788     0      501ms <chr … <Rprofm… <bch:… <tibb…
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.