Java web爬虫的堆空间不足

Java web爬虫的堆空间不足,java,garbage-collection,out-of-memory,Java,Garbage Collection,Out Of Memory,我编写了一个小型爬虫程序,发现它的堆空间不足(尽管我目前将列表中的URL数量限制为300个) 使用Java Memory Analyzer,我发现消费者是char[](64MB中有45MB,如果我增加允许的大小,也会更多;它只是不断增长) 分析器还提供了char[]的内容。它包含由爬虫读取的HTML页面 通过对-Xmx[…]m的不同设置进行更深入的分析,我发现Java使用了几乎所有可用的空间,然后只要我想下载一个3MB大小的图像,就从堆中取出 当我给Java 16MB时,它使用14MB,但失败了

我编写了一个小型爬虫程序,发现它的堆空间不足(尽管我目前将列表中的URL数量限制为300个)

使用Java Memory Analyzer,我发现消费者是
char[]
(64MB中有45MB,如果我增加允许的大小,也会更多;它只是不断增长)

分析器还提供了
char[]
的内容。它包含由爬虫读取的HTML页面

通过对
-Xmx[…]m
的不同设置进行更深入的分析,我发现Java使用了几乎所有可用的空间,然后只要我想下载一个3MB大小的图像,就从堆中取出

当我给Java 16MB时,它使用14MB,但失败了;当我给它64MB时,它使用59MB,当尝试下载一个大映像时失败了

阅读页面是用这段代码完成的(编辑并添加
.close()
):

另一个函数在while循环中使用返回的字符串,但据我所知,一旦字符串被下一页覆盖,就应该释放空间

public void run() {
    boolean stop = false;

    while (stop == false) {
        try {
            Website nextPage = getNextPage();

            String source = visitAndReadPage(nextPage);
            List<Website> links = new LinkExtractor(nextPage).extract(source);
            List<Website> images = new ImageExtractor(nextPage).extract(source);

            // do something with links and images, source is not used anymore
        } catch (CrawlerException e) {
            logger.warning("could not crawl a url");
        }
    }
}
编辑 在使用更多内存对其进行测试后,我在
支配树中发现了这样的URL

Class Name                                                                                                                                                                                                                                                                                              | Shallow Heap | Retained Heap | Percentage

crawling.Website @ 0xa8d28cb0                                                                                                                                                                                                                                                                           |           16 |       759.776 |      0,15%
|- java.net.URL @ 0xa8d289c0  https://www.google.com/recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kN...       |           56 |       759.736 |      0,15%
|  |- char[379486] @ 0xa8c6f4f8  <!DOCTYPE html><html lang="en">  <head>  <meta charset="utf-8">  <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9">  <title>Google Accounts</title><style type="text/css">  html, body, div, h1, h2, h3, h4, h5, h6, p, img, dl,  dt, dd, ol, ul, li, t...    |      758.984 |       758.984 |      0,15%
|  |- java.lang.String @ 0xa8d28a40  /recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl...|           24 |           624 |      0,00%
|  |  '- char[293] @ 0xa8d28a58  /recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl...    |          600 |           600 |      0,00%
|  |- java.lang.String @ 0xa8d289f8  c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl6YmMgFC77kWZR7vvZIPkS...|           24 |            24 |      0,00%
|  |- java.lang.String @ 0xa8d28a10  www.google.com                                                                                                                                                                                                                                                     |           24 |            24 |      0,00%
|  |- java.lang.String @ 0xa8d28a28  /recaptcha/api/image                                                                                                                                                                                                                                               |           24 |            24 |      0,00%
类名|浅堆|保留堆|百分比
爬行。网站@0xa8d28cb0 | 16 | 759.776 | 0,15%
|-java.net.URL@0xa8d289c0https://www.google.com/recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89L3WOEVJEKWW81TDZSNCVPIRQ52ALTW92RP-EUP9THNZWBWHCRLXG6A0BPWU11CGTTRAUTARMWXHDCTVROUMLNZNZEUUUUA7LEDGF7NLYUIR3TGO7U7U7LQ21TZZBHPATSQWYWYUZK3Z9PGMQRQVI7GE44|56 |       759.736 |      0,15%
||-char[379486]@0xa8c6f4f8谷歌账户html、正文、div、h1、h2、h3、h4、h5、h6、p、img、dl、dt、dd、ol、ul、li、t.|758.984 |       758.984 |      0,15%
||-java.lang.String@0xa8d28a40/recaptcha/api/image?c=03AHJ|u VuT4cmbxJaokzwekoqlatcyHT-89L3WOEVJEKW81TDZSNCVPIRQ52AltW92RP-EUP9THNZWBWHCRLXG6A0BPWU11cGTRAutarumWdctVrounZeuU7LedGFCO76NL8UlYuir3TGO7ӟ-Z3ZZ9PGRQ7TQVIGE4ӟ4;-2KW7KKKKK4;
||'-char[293]@0xa8d28a58/recaptcha/api/image?c=03AHJ|u VuuT4CmbxjAoKzWEKOqLaTCyhT-89L3Woovjekw81TDZSNCVPIRQ52AltW92RP-EUP9THNZWBWHCRLXG6A0BPWU11CGTTRAUTRAUMWxDCTVrounzeU7LedGFTOUlYUIR7U7|-2K7; K9PGRQVIGE4ӟ600 |           600 |      0,00%
||-java.lang.String@0xa8d289f8 c=03AHJ|u VuT4CMBxJaokzwekoqlatcyHT-89L3Woevjekw81TdzsncVpirQ52AltW92RP-EUP9ThnzwBwCrlXG6a0BPWu11cGttrautarmWxCvroumJnzeuuA7LedGFtou76NL8 UlYuYu7KwKfR7Z9PmqRvi7Ge4-LexJb2KwKb7Kw7Kf7Kf7Kw7Kw7Kw7Kv7Kf24%
||-java.lang.String@0xa8d28a10 www.google.com | 24 | 24 | 0,00%
||-java.lang.String@0xa8d28a28/recaptcha/api/image | 24 | 24 | 0,00%

从本质上讲,我真的很想知道:为什么
java.net.URL
的HTML源代码部分?这是否来自我打开的URL连接?

我会首先尝试在
readPage
方法的末尾关闭读卡器和URL连接。最好将此逻辑放在
finally
子句中

保持打开的连接将使用内存,并且根据内部情况,GC可能无法回收它,即使您不再在代码中引用它

更新(基于注释):连接本身没有
close()
方法,当连接的所有读卡器都关闭时,连接将关闭

当我给Java 16MB时,它使用14MB,但失败了;当我给它64MB时,它使用59MB,当尝试下载一个大映像时失败了


这并不奇怪,因为你已经接近极限了。3 MB图像在加载(反压缩)时可以解压为60 MB或更多。您可以将最大值增加到1 GB吗?

我不确定您的信息是否会得出垃圾收集不起作用的结论。在分配更多内存时,内存不足。您说您认为有对象符合GC,但JVM不符合。我很肯定我会相信JVM而不是猜测


你的应用程序中的某个地方出现内存泄漏。在某个对象中的某个地方,您保留了对整个网页内容的引用。这会填满你的空闲内存。

很可能有一个引用保存在某个地方,以防止垃圾收集。这总是需要到处捣乱才能纠正。我通常从具有堆分析的分析器开始。如果可能的话,编写一个小的测试程序,加载一个页面,而不是其他很多内容。它可以简单地列出包含一些大图片的3-4个URL。如果p
Class Name                                                                                                                                                                                                                                                                                   | Shallow Heap | Retained Heap | Percentage
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
char[60750] @ 0xb02c3ee0  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...|      121.512 |       121.512 |      1,06%
char[60716] @ 0xb017c9b8  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...|      121.448 |       121.448 |      1,06%
char[60686] @ 0xb01f3c88  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...|      121.384 |       121.384 |      1,06%
char[60670] @ 0xb015ec48  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...|      121.352 |       121.352 |      1,06%
char[60655] @ 0xb01d5d08  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...|      121.328 |       121.328 |      1,06%
char[60651] @ 0xb009d9c0  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...|      121.320 |       121.320 |      1,06%
char[60637] @ 0xb022f418  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...|      121.288 |       121.288 |      1,06%
Class Name                                                                                                                                                                                                                                                                                              | Shallow Heap | Retained Heap | Percentage

crawling.Website @ 0xa8d28cb0                                                                                                                                                                                                                                                                           |           16 |       759.776 |      0,15%
|- java.net.URL @ 0xa8d289c0  https://www.google.com/recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kN...       |           56 |       759.736 |      0,15%
|  |- char[379486] @ 0xa8c6f4f8  <!DOCTYPE html><html lang="en">  <head>  <meta charset="utf-8">  <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9">  <title>Google Accounts</title><style type="text/css">  html, body, div, h1, h2, h3, h4, h5, h6, p, img, dl,  dt, dd, ol, ul, li, t...    |      758.984 |       758.984 |      0,15%
|  |- java.lang.String @ 0xa8d28a40  /recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl...|           24 |           624 |      0,00%
|  |  '- char[293] @ 0xa8d28a58  /recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl...    |          600 |           600 |      0,00%
|  |- java.lang.String @ 0xa8d289f8  c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl6YmMgFC77kWZR7vvZIPkS...|           24 |            24 |      0,00%
|  |- java.lang.String @ 0xa8d28a10  www.google.com                                                                                                                                                                                                                                                     |           24 |            24 |      0,00%
|  |- java.lang.String @ 0xa8d28a28  /recaptcha/api/image                                                                                                                                                                                                                                               |           24 |            24 |      0,00%