在R中转换HTML字符实体编码 在R中有没有转换HTML字符实体编码的方法?
我想转换HTML字符实体,如在R中转换HTML字符实体编码 在R中有没有转换HTML字符实体编码的方法?,html,r,encoding,character-encoding,Html,R,Encoding,Character Encoding,我想转换HTML字符实体,如 &至和或 至 对于Perl,有一个包HTML::Entities可以做到这一点,但我在R中找不到类似的东西 我也尝试了iconv(),但没有得到令人满意的结果。也许还有一种方法可以使用XML包,但我还没有找到它。更新:这个答案已经过时了。请检查基于新xml2包的 尝试以下几点: # load XML package library(XML) # Convenience function to convert html codes html2txt <
&代码>至和或
至
对于Perl,有一个包HTML::Entities可以做到这一点,但我在R中找不到类似的东西
我也尝试了iconv()
,但没有得到令人满意的结果。也许还有一种方法可以使用XML
包,但我还没有找到它。更新:这个答案已经过时了。请检查基于新xml2包的
尝试以下几点:
# load XML package
library(XML)
# Convenience function to convert html codes
html2txt <- function(str) {
xpathApply(htmlParse(str, asText=TRUE),
"//body//text()",
xmlValue)[[1]]
}
# html encoded string
( x <- paste("i", "s", "n", "&", "a", "p", "o", "s", ";", "t", sep = "") )
[1] "isn't"
# converted string
html2txt(x)
[1] "isn't"
#加载XML包
库(XML)
#转换html代码的便利函数
HTML2Text使用xml2
package查看xml/html值:
unescape_xml <- function(str){
xml2::xml_text(xml2::read_xml(paste0("<x>", str, "</x>")))
}
unescape_html <- function(str){
xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))
}
unescape_xml在执行此任务时,它的缺点是它没有矢量化,因此如果应用于大量字符,速度会很慢。此外,它仅适用于长度为1的字符向量,对于较长的字符向量,必须使用sapply
为了演示这一点,我首先创建一个大字符向量:
set.seed(123)
strings <- c("abcd", "& ' >", "&", "€ <")
many_strings <- sample(strings, 10000, replace = TRUE)
当然,您需要注意,用于组合str
中各种字符串的字符串(在我的示例中为“#|”
)不会出现在str
中的任何位置。否则,当最后再次拆分大字符串时,您将引入一个错误。根据答案,我开始对函数进行基准测试
# first create large vector as in Stibu's answer
set.seed(123)
strings <- c("abcd", "& ' >", "&", "€ <")
many_strings <- sample(strings, 10000, replace = TRUE)
# then benchmark the functions by Stibu and Jeroen
bench::mark(
textutils::HTMLdecode(many_strings),
map_chr(many_strings, unescape_html),
unescape_html2(many_strings)
)
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <lis>
1 textutils::HTMLdecode(many_strings) 855.02ms 855.02ms 1.17 329.18MB 10.5 1 9 855.02ms <chr … <Rpro… <bch…
2 map_chr(many_strings, unescape_html) 1.09s 1.09s 0.919 6.79MB 5.51 1 6 1.09s <chr … <Rpro… <bch…
3 unescape_html2(many_strings) 4.85ms 5.13ms 195. 581.48KB 0 98 0 503.63ms <chr … <Rpro… <bch…
# … with 1 more variable: gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
但是,当处理多个\u字符串
对象时,此函数失败(可能是因为读取\u xml
无法读取欧元符号。因此,我必须尝试另一种基准测试方法
library(tidyverse)
library(rvest)
entity_html <- read_html("https://dev.w3.org/html5/html-author/charref")
entity_mapping <- entity_html %>%
html_node(css = "table") %>%
html_table() %>%
rename(text = X1,
named = X2,
hex = X3,
dec = X4,
desc = X5) %>%
as_tibble
s2 <- entity_mapping %>% pull(dec) # dec can be replaced by hex or named
bench::mark(
textutils::HTMLdecode(s2),
map_chr(s2, unescape_xml),
map_chr(s2, unescape_html),
unescape_xml2(s2),
unescape_html2(s2)
)
# A tibble: 5 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 textutils::HTMLdecode(s2) 191.7ms 194.9ms 5.16 64.1MB 10.3 3 6 582ms <chr … <Rprofm… <bch:… <tibb…
2 map_chr(s2, unescape_xml) 73.8ms 80.9ms 11.9 1006.9KB 5.12 7 3 586ms <chr … <Rprofm… <bch:… <tibb…
3 map_chr(s2, unescape_html) 162.4ms 163.7ms 5.83 1006.9KB 5.83 3 3 514ms <chr … <Rprofm… <bch:… <tibb…
4 unescape_xml2(s2) 459.2µs 473µs 2034. 37.9KB 2.00 1017 1 500ms <chr … <Rprofm… <bch:… <tibb…
5 unescape_html2(s2) 590µs 607.5µs 1591. 37.9KB 2.00 796 1 500ms <chr … <Rprofm… <bch:… <tibb…
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
这里的xml
版本甚至比html
版本更快。这在我的设备上引发了一个错误。可以通过搜索Rhelp找到另一种方法:我得到了以下错误:`xml内容似乎不是xml,也不能识别文件名`is&apos;t``很好的答案!您介意解释一下
是需要的吗?如果那不是正确的地方,我很乐意提出一个新问题。我只是让Jeroen的答案更有效,基本的想法是他的,而不是我的。如果你在没有
和
的情况下尝试代码,你会注意到它失败并出现错误。原因是阅读了html()
可用于包含HTML代码的字符串或HTML文件的路径。如果字符不包含
(即,不包含单个HTML标记),则函数会假定它正在使用路径,并尝试读取一个当然不存在的文件。
unescape_html2 <- function(str){
html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")
parsed <- xml2::xml_text(xml2::read_html(html))
strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}
system.time(res2 <- unescape_html2(many_strings))
## user system elapsed
## 0.011 0.000 0.010
identical(res, res2)
## [1] TRUE
# first create large vector as in Stibu's answer
set.seed(123)
strings <- c("abcd", "& ' >", "&", "€ <")
many_strings <- sample(strings, 10000, replace = TRUE)
# then benchmark the functions by Stibu and Jeroen
bench::mark(
textutils::HTMLdecode(many_strings),
map_chr(many_strings, unescape_html),
unescape_html2(many_strings)
)
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <lis>
1 textutils::HTMLdecode(many_strings) 855.02ms 855.02ms 1.17 329.18MB 10.5 1 9 855.02ms <chr … <Rpro… <bch…
2 map_chr(many_strings, unescape_html) 1.09s 1.09s 0.919 6.79MB 5.51 1 6 1.09s <chr … <Rpro… <bch…
3 unescape_html2(many_strings) 4.85ms 5.13ms 195. 581.48KB 0 98 0 503.63ms <chr … <Rpro… <bch…
# … with 1 more variable: gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
unescape_xml2 <- function(str){
html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")
parsed <- xml2::xml_text(xml2::read_xml(html))
strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}
library(tidyverse)
library(rvest)
entity_html <- read_html("https://dev.w3.org/html5/html-author/charref")
entity_mapping <- entity_html %>%
html_node(css = "table") %>%
html_table() %>%
rename(text = X1,
named = X2,
hex = X3,
dec = X4,
desc = X5) %>%
as_tibble
s2 <- entity_mapping %>% pull(dec) # dec can be replaced by hex or named
bench::mark(
textutils::HTMLdecode(s2),
map_chr(s2, unescape_xml),
map_chr(s2, unescape_html),
unescape_xml2(s2),
unescape_html2(s2)
)
# A tibble: 5 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 textutils::HTMLdecode(s2) 191.7ms 194.9ms 5.16 64.1MB 10.3 3 6 582ms <chr … <Rprofm… <bch:… <tibb…
2 map_chr(s2, unescape_xml) 73.8ms 80.9ms 11.9 1006.9KB 5.12 7 3 586ms <chr … <Rprofm… <bch:… <tibb…
3 map_chr(s2, unescape_html) 162.4ms 163.7ms 5.83 1006.9KB 5.83 3 3 514ms <chr … <Rprofm… <bch:… <tibb…
4 unescape_xml2(s2) 459.2µs 473µs 2034. 37.9KB 2.00 1017 1 500ms <chr … <Rprofm… <bch:… <tibb…
5 unescape_html2(s2) 590µs 607.5µs 1591. 37.9KB 2.00 796 1 500ms <chr … <Rprofm… <bch:… <tibb…
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
> bench::mark(
+ # gsubreplace_mapping(s2, entity_mapping),
+ # gsubreplace_local(s2),
+ textutils::HTMLdecode(s3),
+ map_chr(s3, unescape_xml),
+ map_chr(s3, unescape_html),
+ unescape_xml2(s3),
+ unescape_html2(s3)
+ )
# A tibble: 5 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 textutils::HTMLdecode(s3) 204.2ms 212.3ms 4.72 64.1MB 7.87 3 5 636ms <chr … <Rprofm… <bch:… <tibb…
2 map_chr(s3, unescape_xml) 76.4ms 80.2ms 11.8 1006.9KB 5.04 7 3 595ms <chr … <Rprofm… <bch:… <tibb…
3 map_chr(s3, unescape_html) 164.6ms 165.3ms 5.80 1006.9KB 5.80 3 3 518ms <chr … <Rprofm… <bch:… <tibb…
4 unescape_xml2(s3) 487.4µs 500.5µs 1929. 74.5KB 2.00 965 1 500ms <chr … <Rprofm… <bch:… <tibb…
5 unescape_html2(s3) 611.1µs 627.7µs 1574. 40.4KB 0 788 0 501ms <chr … <Rprofm… <bch:… <tibb…
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.