在Windows上使用包XML时内存泄漏
在阅读过(包括链接帖子)和R帮助后,考虑到一段时间又过去了,我仍然认为这是一个未解决的问题,值得关注,因为该软件包在整个R世界中广泛使用在Windows上使用包XML时内存泄漏,xml,windows,r,parsing,memory-leaks,Xml,Windows,R,Parsing,Memory Leaks,在阅读过(包括链接帖子)和R帮助后,考虑到一段时间又过去了,我仍然认为这是一个未解决的问题,值得关注,因为该软件包在整个R世界中广泛使用 因此请考虑这是一个后续的帖子和/或参考,希望对问题> 问题 解析XML/HTML文档时,需要在内部使用C指针(AFAIU),以便以后可以对其进行搜索。而且,至少在MS Windows上(我在Windows 8.1上运行,64位),垃圾收集器无法正确识别这些引用。因此,未正确释放消耗的内存,这导致R进程在某个点冻结 迄今为止的主要发现 在我看来,XML:fre
<强>因此请考虑这是一个后续的帖子和/或参考,希望对问题> 问题 解析XML/HTML文档时,需要在内部使用C指针(AFAIU),以便以后可以对其进行搜索。而且,至少在MS Windows上(我在Windows 8.1上运行,64位),垃圾收集器无法正确识别这些引用。因此,未正确释放消耗的内存,这导致R进程在某个点冻结
迄今为止的主要发现 在我看来,XML:free
和/或gc
似乎无法识别通过xmlParse
或htmlParse
解析XML/HTML文档并随后使用xpathApply
等工具处理它们时所涉及的所有内存:
OS任务(Rterm.exe)报告的内存使用量增加得非常快,而R进程报告的内存“从R内部看”(functionmemory.size
)适度增加(即,相比之下)。请参阅下面大量解析循环前后的元素列表mem\u r
,mem\u os
和比率
总之,再加上推荐的所有内容(free
、rm
和gc
),当调用xmlParse
等时,内存使用仍然始终增加。这只是一个多少钱的问题。所以我想肯定还有什么东西不能正常工作
插图
我从Duncan的Omegahat中借用了分析代码
一些准备工作:
Sys.setenv("LANGUAGE"="en")
require("compiler")
require("XML")
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] compiler stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] XML_3.98-1.1
我们需要的功能:
getTaskMemoryByPid <- cmpfun(function(
pid=Sys.getpid()
) {
cmd <- sprintf("tasklist /FI \"pid eq %s\" /FO csv", pid)
mem <- read.csv(text=shell(cmd, intern = TRUE), stringsAsFactors=FALSE)[,5]
mem <- as.numeric(gsub("\\.|\\s|K", "", mem))/1000
mem
}, options=list(suppressAll=TRUE))
memoryLeak <- cmpfun(function(
x=system.file("exampleData", "mtcars.xml", package="XML"),
n=10000,
use_text=FALSE,
xpath=FALSE,
free_doc=FALSE,
clean_up=FALSE,
detailed=FALSE
) {
if(use_text) {
x <- readLines(x)
}
## Before //
mem_os <- getTaskMemoryByPid()
mem_r <- memory.size()
prof_1 <- memory.profile()
mem_before <- list(mem_r=mem_r,
mem_os=mem_os, ratio=mem_os/mem_r)
## Per run //
mem_perrun <- lapply(1:n, function(ii) {
doc <- xmlParse(x, asText=use_text)
if (xpath) {
res <- xpathApply(doc=doc, path="/blah", fun=xmlValue)
rm(res)
}
if (free_doc) {
free(doc)
}
rm(doc)
out <- NULL
if (detailed) {
out <- list(
profile=memory.profile(),
size=memory.size()
)
}
out
})
has_perrun <- any(sapply(mem_perrun, length) > 0)
if (!has_perrun) {
mem_perrun <- NULL
}
## Garbage collect //
mem_gc <- NULL
if(clean_up) {
gc()
tmp <- gc()
mem_gc <- list(gc_mb=tmp["Ncells", "(Mb)"])
}
## After //
mem_os <- getTaskMemoryByPid()
mem_r <- memory.size()
prof_2 <- memory.profile()
mem_after <- list(mem_r=mem_r,
mem_os=mem_os, ratio=mem_os/mem_r)
list(
before=mem_before,
perrun=mem_perrun,
gc=mem_gc,
after=mem_after,
comparison_r=data.frame(
before=prof_1,
after=prof_2,
increase=round((prof_2/prof_1)-1, 4)
),
increase_r=(mem_after$mem_r/mem_before$mem_r)-1,
increase_os=(mem_after$mem_os/mem_before$mem_os)-1
)
}, options=list(suppressAll=TRUE))
情景2
快速事实:启用垃圾收集,显式调用free
,XML文档被解析n次
,但未通过xpathApply
搜索
请注意OS内存与R内存的比率:
之前:1.315249
之后:1.222143
res <- memoryLeak(clean_up=TRUE, free_doc=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-2.rdata"))
> res
$before
$before$mem_r
[1] 63.48
$before$mem_os
[1] 83.492
$before$ratio
[1] 1.315249
$perrun
NULL
$gc
$gc$gc_mb
[1] 69.3
$after
$after$mem_r
[1] 95.92
$after$mem_os
[1] 117.228
$after$ratio
[1] 1.222143
$comparison_r
before after increase
NULL 1 1 0.0000
symbol 7454 7454 0.0000
pairlist 392455 592466 0.5096
closure 55104 105104 0.9074
environment 51032 101032 0.9798
promise 105226 205226 0.9503
language 55592 55592 0.0000
special 44 44 0.0000
builtin 648 648 0.0000
char 8847 8848 0.0001
logical 9141 9141 0.0000
integer 23109 23111 0.0001
double 2802 2807 0.0018
complex 1 1 0.0000
character 94775 144781 0.5276
... 0 0 NaN
any 0 0 NaN
list 20174 20177 0.0001
expression 1 1 0.0000
bytecode 16265 16265 0.0000
externalptr 1488 1487 -0.0007
weakref 392 391 -0.0026
raw 393 392 -0.0025
S4 1392 1392 0.0000
$increase_r
[1] 0.5110271
$increase_os
[1] 0.4040627
我也试过不同的版本。嗯,我试过了;-)
来源:omegahat.org
仅供参考:最新的Rtools 3.1已安装并包含在Windows路径中(例如,安装stringr
格式的源代码工作正常)
github
我没有遵循github回购协议中的建议,因为它指出只包含tar.gz
版本3.94-0
(而我们在CRAN上的版本是3.98-1.1
)
尽管声明gihub repo不是标准的R包结构,但我还是用install\u github
-尝试了它,但失败了;-)
自从我发布了这个问题后,什么也没发生,所以我想我应该再次引起大家的注意
这是我调查的最新版本
预备赛
功能
生成其他脱机示例内容
s虽然它还处于婴儿期(只有几个月大!),并且有一些怪癖,Hadley Wickham编写了一个XML解析库,xml2
,可以在Github上找到。它被限制为读取而不是写入XML,但是对于解析XML,我一直在尝试,它看起来可以完成这项工作,而不会导致XML包的内存泄漏!它提供的功能包括:
read_xml()
读取xml文件
xml\u children()
获取节点的子节点
xml\u text()
获取标记中的文本
xml\u attrs()
请注意,您仍然需要确保在处理完XML节点对象后执行rm()
,并使用gc()
强制垃圾收集,但内存确实会释放到O/S(免责声明:仅在Windows 7上测试,但这似乎是最“内存泄漏”的平台)
希望这对别人有帮助 根据Matthew Wise关于使用xml2的上述回答,我发现真正释放内存的函数是xml\u remove()
,后跟gc()
,而不是rm()
作为github问题而不是堆栈溢出问题,这不是更好吗?至少作者们有更好的机会看到它。好吧,你说得有道理;-)没想到那么远。但我已经联系过邓肯·坦普尔·朗了,调查得很好。如果问题得到确认和/或解决,请发布一个答案,我很想知道这将在哪里结束。@tonytonov谢谢,伙计。当然,我会让你随时更新@说唱歌手你的调查有进展吗?我通过单步调试源代码尝试了自己的测试,但在调试模式下我不断遇到错误,所以我没有走多远。太棒了!!!我在Hadley的一些GitHub问题/评论中偶然发现了xml2
,并默默地希望这意味着他正在R中重新实现一个XML解析器。很高兴听到事实就是这样,非常感谢指针!谢谢你的跟进!
res <- memoryLeak(clean_up=TRUE, free_doc=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-2.rdata"))
> res
$before
$before$mem_r
[1] 63.48
$before$mem_os
[1] 83.492
$before$ratio
[1] 1.315249
$perrun
NULL
$gc
$gc$gc_mb
[1] 69.3
$after
$after$mem_r
[1] 95.92
$after$mem_os
[1] 117.228
$after$ratio
[1] 1.222143
$comparison_r
before after increase
NULL 1 1 0.0000
symbol 7454 7454 0.0000
pairlist 392455 592466 0.5096
closure 55104 105104 0.9074
environment 51032 101032 0.9798
promise 105226 205226 0.9503
language 55592 55592 0.0000
special 44 44 0.0000
builtin 648 648 0.0000
char 8847 8848 0.0001
logical 9141 9141 0.0000
integer 23109 23111 0.0001
double 2802 2807 0.0018
complex 1 1 0.0000
character 94775 144781 0.5276
... 0 0 NaN
any 0 0 NaN
list 20174 20177 0.0001
expression 1 1 0.0000
bytecode 16265 16265 0.0000
externalptr 1488 1487 -0.0007
weakref 392 391 -0.0026
raw 393 392 -0.0025
S4 1392 1392 0.0000
$increase_r
[1] 0.5110271
$increase_os
[1] 0.4040627
res <- memoryLeak(clean_up=TRUE, free_doc=TRUE, xpath=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-3.rdata"))
res
$before
$before$mem_r
[1] 95.94
$before$mem_os
[1] 117.088
$before$ratio
[1] 1.220429
$perrun
NULL
$gc
$gc$gc_mb
[1] 93.4
$after
$after$mem_r
[1] 124.64
$after$mem_os
[1] 1639.8
$after$ratio
[1] 13.15629
$comparison_r
before after increase
NULL 1 1 0.0000
symbol 7454 7460 0.0008
pairlist 592458 793042 0.3386
closure 105104 155110 0.4758
environment 101032 151032 0.4949
promise 205226 305226 0.4873
language 55592 55882 0.0052
special 44 44 0.0000
builtin 648 648 0.0000
char 8847 8867 0.0023
logical 9142 9162 0.0022
integer 23109 23112 0.0001
double 2802 2832 0.0107
complex 1 1 0.0000
character 144775 194819 0.3457
... 0 0 NaN
any 0 0 NaN
list 20174 20177 0.0001
expression 1 1 0.0000
bytecode 16265 16265 0.0000
externalptr 1488 1487 -0.0007
weakref 392 391 -0.0026
raw 393 392 -0.0025
S4 1392 1392 0.0000
$increase_r
[1] 0.2991453
$increase_os
[1] 13.00485
> install.packages("XML", repos="http://www.omegahat.org/R", type="source")
trying URL 'http://www.omegahat.org/R/src/contrib/XML_3.98-1.tar.gz'
Content type 'application/x-gzip' length 1543387 bytes (1.5 Mb)
opened URL
downloaded 1.5 Mb
* installing *source* package 'XML' ...
Please define LIB_XML (and LIB_ZLIB, LIB_ICONV)
Warning: running command 'sh ./configure.win' had status 1
ERROR: configuration failed for package 'XML'
* removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
* restoring previous 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
The downloaded source packages are in
'C:\Users\rappster_admin\AppData\Local\Temp\RtmpQFZ2Ck\downloaded_packages'
Warning messages:
1: running command '"R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" CMD INSTALL -l "R:\home\apps\lsqmapps\apps\r\R-3.1.0\library" C:\Users\RAPPST~1\AppData\Local\Temp\RtmpQFZ2Ck/downloaded_packages/XML_3.98-1.tar.gz' had status 1
2: In install.packages("XML", repos = "http://www.omegahat.org/R", :
installation of package 'XML' had non-zero exit status
require("devtools")
> install_github(repo="XML", username="omegahat")
Installing github repo XML/master from omegahat
Downloading master.zip from https://github.com/omegahat/XML/archive/master.zip
Installing package from C:\Users\RAPPST~1\AppData\Local\Temp\RtmpQFZ2Ck/master.zip
Installing XML
"R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" --vanilla CMD INSTALL \
"C:\Users\rappster_admin\AppData\Local\Temp\RtmpQFZ2Ck\devtools15c82d7c2b4c\XML-master" \
--library="R:/home/apps/lsqmapps/apps/r/R-3.1.0/library" --with-keep.source \
--install-tests
* installing *source* package 'XML' ...
Please define LIB_XML (and LIB_ZLIB, LIB_ICONV)
Warning: running command 'sh ./configure.win' had status 1
ERROR: configuration failed for package 'XML'
* removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
* restoring previous 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
Error: Command failed (1)
require("rvest")
require("XML")
getTaskMemoryByPid <- function(
pid = Sys.getpid()
) {
cmd <- sprintf("tasklist /FI \"pid eq %s\" /FO csv", pid)
mem <- read.csv(text=shell(cmd, intern = TRUE), stringsAsFactors=FALSE)[,5]
mem <- as.numeric(gsub("\\.|\\s|K", "", mem))/1000
mem
}
getCurrentMemoryStatus <- function() {
mem_os <- getTaskMemoryByPid()
mem_r <- memory.size()
prof_1 <- memory.profile()
list(r = mem_r, os = mem_os, ratio = mem_os/mem_r)
}
memoryLeak <- function(
x = system.file("exampleData", "mtcars.xml", package="XML"),
n = 10000,
use_text = FALSE,
xpath = FALSE,
free_doc = FALSE,
clean_up = FALSE,
detailed = FALSE,
use_rvest = FALSE,
user_agent = httr::user_agent("Mozilla/5.0")
) {
if(use_text) {
x <- readLines(x)
}
## Before //
prof_1 <- memory.profile()
mem_before <- getCurrentMemoryStatus()
## Per run //
mem_perrun <- lapply(1:n, function(ii) {
doc <- if (!use_rvest) {
xmlParse(x, asText = use_text)
} else {
if (file.exists(x)) {
## From disk //
rvest::html(x)
} else {
## From web //
rvest::html_session(x, user_agent)
}
}
if (xpath) {
res <- xpathApply(doc = doc, path = "/blah", fun = xmlValue)
rm(res)
}
if (free_doc) {
free(doc)
}
rm(doc)
out <- NULL
if (detailed) {
out <- list(
profile = memory.profile(),
size = memory.size()
)
}
out
})
has_perrun <- any(sapply(mem_perrun, length) > 0)
if (!has_perrun) {
mem_perrun <- NULL
}
## Garbage collect //
mem_gc <- NULL
if(clean_up) {
gc()
tmp <- gc()
mem_gc <- list(gc_mb = tmp["Ncells", "(Mb)"])
}
## After //
prof_2 <- memory.profile()
mem_after <- getCurrentMemoryStatus()
## Return value //
if (detailed) {
list(
before = mem_before,
perrun = mem_perrun,
gc = mem_gc,
after = mem_after,
comparison_r = data.frame(
before = prof_1,
after = prof_2,
increase = round((prof_2/prof_1)-1, 4)
),
increase_r = (mem_after$r/mem_before$r)-1,
increase_os = (mem_after$os/mem_before$os)-1
)
} else {
list(
before_after = data.frame(
r = c(mem_before$r, mem_after$r),
os = c(mem_before$os, mem_after$os)
),
increase_r = (mem_after$r/mem_before$r)-1,
increase_os = (mem_after$os/mem_before$os)-1
)
}
}
getCurrentMemoryStatus()
s <- html_session("http://had.co.nz/")
tmp <- capture.output(httr::content(s$response))
write(tmp, file = "hadley.html")
# html("hadley.html")
s <- html_session(
"http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd",
httr::user_agent("Mozilla/5.0"))
tmp <- capture.output(httr::content(s$response))
write(tmp, file = "amazon.html")
# html("amazon.html")
getCurrentMemoryStatus()
################
## Mtcars.xml ##
################
res <- memoryLeak(n = 50000, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.1.rdata")
save(res, file = fpath)
res <- memoryLeak(n = 50000, clean_up = TRUE, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.2.rdata")
save(res, file = fpath)
res <- memoryLeak(n = 50000, clean_up = TRUE, free_doc = TRUE, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.3.rdata")
save(res, file = fpath)
###################
## www.had.co.nz ##
###################
## Offline //
res <- memoryLeak(x = "hadley.html", n = 50000, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.1.rdata")
save(res, file = fpath)
res <- memoryLeak(x = "hadley.html", n = 50000, clean_up = TRUE,
detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.2.rdata")
save(res, file = fpath)
res <- memoryLeak(x = "hadley.html", n = 50000, clean_up = TRUE,
free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.3.rdata")
save(res, file = fpath)
## Online (PLEASE USE "POLITE" VALUE FOR `n`!!!) //
.url <- "http://had.co.nz/"
res <- memoryLeak(x = .url, n = 50, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.1.rdata")
save(res, file = fpath)
res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.2.rdata")
save(res, file = fpath)
res <- memoryLeak(x = .url, n = 50, clean_up = TRUE,
free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.3.rdata")
save(res, file = fpath)
####################
## www.amazon.com ##
####################
## Offline //
res <- memoryLeak(x = "amazon.html", n = 50000, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.1.rdata")
save(res, file = fpath)
res <- memoryLeak(x = "amazon.html", n = 50000, clean_up = TRUE,
detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.2.rdata")
save(res, file = fpath)
res <- memoryLeak(x = "amazon.html", n = 50000, clean_up = TRUE,
free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.3.rdata")
save(res, file = fpath)
## Online (PLEASE USE "POLITE" VALUE FOR `n`!!!) //
.url <- "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd"
res <- memoryLeak(x = .url, n = 50, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.1.rdata")
save(res, file = fpath)
res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.2.rdata")
save(res, file = fpath)
res <- memoryLeak(x = .url, n = 50, clean_up = TRUE,
free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.3.rdata")
save(res, file = fpath)